LLM fine-tuning is becoming a critical capability. As organizations add Large Language Models (LLMs) to their operational systems, monitoring both model performance and model accuracy is becoming increasingly important. This post (which covers LLM fine-tuning) is the first in a series that looks at the challenges and techniques for deploying and managing LLMs.
1. Introduction – How Does LLM fine-tuning work?
Although large language models (LLMs) have powerful natural language capabilities, they can be costly to train. Nevertheless, technologies such as OLLAMA (Omni-Layer Learning Language Acquisition Model) are developed for efficient deployment and tuning of LLMs. For example, they allow the data scientist to efficiently fine-tune select areas of underlying neural network models with new training data.
In this guide, we would like to share how to fine-tune and improve your existing base LLM models using new training data. Generally speaking, fine-tuning is broadly composed of the following steps:
- Collect new training data
- Generate an adapter patch
- Patch the existing base model with the adapter patch
2. Collect new training data
We want to fine-tune a base model by learning from some natural language corpus such as a chat conversation excerpt. First, you need to get your excerpt training data into an appropriate format.
In the upcoming demo in this guide, we obtain the chat content from the popular demo guanaco dataset. Nevertheless, we can include a custom chat training dataset using a variety of formats such as JSON (JavaScript Object Notation).
3. Generate an adapter patch
With the training data, we can now create an adapter patch model. The adapter patch allows us to efficiently train LLMs by dealing with fewer parameters, saving computational resources, and promoting task-specific customizations. Obtaining the adapter patch involves the following main steps. We will contextualize the steps with a demo setup afterwards.
- Load the training data
- This step loads the training data (to be used in training – Step 4).
- Load the base model
- This step loads the base model and extracts the tokenizer for the model (to be used in training – Step 4)
- The tokenizer conveys how to section text into ‘token’ units for natural language processing.
- Load parameters
- This step includes preparing PEFT (Parameter-Efficient Fine-Tuning) parameters with the LoRA (Low-Rank Adaptation of Large Language Models) technique, and other training parameters (to be used in training – Step 4)
- Run training with the training data, base model, and parameters
- This step actually calls the train function of your LLM library to perform the training
- Save the new model and tokenizer
- This step saves the model so it can later be used for inference
- Convert the adapter patch to the appropriate GGML format with OLLAMA
- This step finally turns the adapter model file to the GGML formatted file suited for the next merge step
Let us try a demo setup via JupyterLab as follows [based on the datacamp link]:
- There is a demo notebook “demo.ipynb” in the current directory [our own generic demo link if provided]
The notebook cell lines corresponding to the main steps are as follows:
- Load the training data
- You may modify the new_model variable to your model name preference
- You may modify the new_model variable to your model name preference
2. Load the base model and tokenizer
3. Load parameters
4. Run training with the training data, base model, and the parameters
5. Save the new model and tokenizer with new model variable
After running, the tuned adapter model is saved as a binary file under the specified model directory (i.e. llama-2-7b-chat-custom/adapter-model.bin).
For step 6, use the OLLAMA library to convert the adapter model file to the GGML file.
This can be encapsulated in a script as follows:
At the end of these steps, we generate the GGML-formatted adapter model file (e.g. “ggml-adapter-model.bin”) that we will use to finally tune the base model.
4. Patch the existing base model with the adapter patch
Given the adapter patch, we can now effectively fine-tune the base mode in two steps:
- Here, call the merge library function of the base model to perform a merge between the loaded base model and the adapter patch.
- The transformer library from HuggingFace uses a merge_and_unload function, for instance.
- Thereafter, save the resulting model and tokenizer to your work directory. The model will be output as the safetensor serialization file while the tokenizer will be output as json.
These steps can be encapsulated in a script.
The tuned model can now be used for inference to support your AI tasks.
5. Conclusion
In conclusion, we discussed how to fine-tune LLM models with training data from a natural language corpus such as a conversation excerpt.
In particular, we learned to prepare a custom corpus into structured training data (i.e. JSON formatted data). We followed a set of steps to efficiently train a baseline model with the training data and obtain an adapter patch. Finally, we merged the baseline model with the generated adapter patch.
Congratulations, you now understand how LLM fine-tuning works!