Large model fine-tuning of the entire process
It is recommended to strictly follow the above process during the fine-tuning process and avoid skipping steps, which may lead to ineffective labor. For example, if the dataset is not adequately constructed and the poor effect of the fine-tuned model is eventually found to be a problem with the quality of the dataset, then the preliminary efforts will be wasted and half the effort will be wasted.
Data set collection and organization
Based on the availability of datasets, datasets can be categorized into two types: publicly available datasets and datasets that are difficult to access.
How do I get access to publicly available datasets?
The easiest way to access publicly available datasets is to search and download them through relevant open source platforms. For example, platforms such as GitHub, Hugging Face, Kaggle, Magic Hitch, etc. provide a large number of open datasets. In addition, you can also try to get data from some websites through crawler technology, such as posting, Zhihu, industry vertical websites, etc. Using crawlers to grab data usually requires some technical support and following relevant laws and regulations.
What if the required data is not available or difficult to obtain across the network?
When existing publicly available datasets do not meet the demand, another option is to build the dataset yourself. However, building hundreds to thousands of datasets manually is often both tedious and time-consuming. So, how can you build datasets efficiently? Two common ideas for building datasets quickly are described below:
1. Leveraging the "data enhancement" capabilities of the Big Model platform
Currently many big modeling platforms provide the function of data enhancement, which can effectively help us expand the dataset. For example, mass spectrometry open platform, Xunfei open platform, volcano open platform, etc., can be enhanced by these platforms to quickly generate more samples using the original data. The specific operation is: first, manually prepare a small amount of data (e.g., 50 pieces) and upload it to these platforms. The platforms will expand the data through data enhancement technology to quickly realize the expansion of the data set.
2. Generating data using large models
Another efficient way to generate data is with the help of a big model. First, a small amount (e.g., a few dozen) of data is prepared and fed as examples to the big model. The big model can generate similar data content based on these examples. To ensure the quality of the generated data, it is recommended that you do not generate too much data at once in the first round of generation; it is recommended that you generate 20 pieces of data and review them carefully to identify common mistakes that the Big Model may make. These examples of errors are then fed back to the big model to improve the generation.
Example Prompt: Referring to the following example, output 20 similar "recent luck" questions. Please note that the language should be life-like and not too exaggerated, avoid the use of determiners or gerunds such as "in activities", "in competitions", etc., and try to be as concise as possible, avoiding unnecessary modifications. Example: Is this quarter going well for me? Will I be rich this quarter? What things are waiting for me lately? What will happen in the near future? Will I be favored by the gods of luck in the near future?
Data cleaning and fine-tuning dataset construction
Raw datasets are usually not compatible with model training, so they need to be cleaned and processed to make them compatible with the format in which the model was learned. In most cases, fine-tuned datasets are usually constructed using the AIpaca format.
Introduction to the AIpaca format
The AIpaca format has a simple structure that makes it particularly suitable for tasks such as text generation, translation, summarization, and especially single-round task-oriented fine-tuning of instructions. The following is a detailed explanation of its main components:
- instruction: A task instruction, similar to a user input (required field) that explicitly tells the model what to do.
- input: The input required by the task. Usually can be an empty string when the task is open-ended or does not require explicit input.
- output: The desired output that the model needs to generate given the instructions and inputs, i.e. the reference answer (required).
- system: can be regarded as the model's pre-prompt, which is used to set the model's behavior, usually some initial prompts (optional) for the model.
- history: Historical dialog information that helps the model build contextual relationships. In a multi-round task, the model generates answers (optional) based on the history dialog.
For example, AIpaca data is formatted as follows:
[ { "instruction": "Form a sentence using the words 'dream', 'perseverance' and 'success'." , "input": "", "output": "Only by persisting in chasing your dreams can you ultimately succeed." }, { "instruction": "Give a short introduction to The Romance of the Three Kingdoms." , "input": "", "output": "Romance of the Three Kingdoms is one of the Four Great Masterpieces of Classical China, depicting the history of nearly a hundred years from the end of the Eastern Han Dynasty to the early years of the Western Jin Dynasty, recounting the chaos of the wars of division among the heroes in the late Eastern Han Dynasty and the political and military struggles among the three kingdoms of Wei, Shu, and Wu, and portraying a number of heroic characters." } ]
Data set construction methodology
In order to make a fine-tuning dataset, we usually need to convert the data into a JSON file that conforms to the AIpaca format. Below are two common methods of making datasets:
- Manual construction of datasets: After organizing the required data and constructing it according to the AIpaca format, you can manually construct a dataset by writing Python code. This method is suitable for simple scenarios and small datasets.
- Automated construction using large models: automate the generation of datasets by invoking the large model interface. This approach is suitable for large-scale datasets, especially when the instructions and output patterns of the task are relatively fixed.
Full dataset format
The complete AIpaca format is shown below and contains the task's commands, inputs, outputs, system prompt words, and historical dialog information:
[ { "instruction": "Human Instruction (required)", "input": "Human Input (optional)", "output": "Model Answer (required)", "system": "System prompts (optional)", "history": [ ["Round 2 Instructions (optional)", "Round 2 Answers (optional)" ] ] } ]
The format helps the model learn the mapping relationship from instructions to outputs, similar to giving the model practice problems where instruction + input = question and output = answer.
Base Model Selection
- Model Type Selection: Select the base model, such as GPT, LLaMA, or BERT, based on the task requirements.
- Size and Parameters: decide on the model size (e.g., 7B, 13B, or 65B parameter sizes), taking into account computational resources, training time, and inference speed.
- Open Source vs Commercial Models: Analyze the need to choose between open source models (e.g., LLaMA, Falcon) or commercial closed source models (e.g., OpenAI GPT family).
- Using the test data, do a comparison test to find the best fit among the multiple models chosen.
Description of model parameters
The Five Questions of the Soul
I. What is fine-tuning?
Fine-tuning is the process of further training an already pre-trained model with a new dataset. These pre-trained models usually have already learned rich features and knowledge on large datasets, and have certain generalized capabilities. The core goal of fine-tuning is to migrate this generalized knowledge to a new, more specific task or domain so that the model can better solve a specific problem.
II. Why fine-tuning?
1. Savings in computing resources
Training a large model from scratch requires a lot of computational resources and time and is very costly. Fine-tuning utilizes pre-trained models as a starting point and requires less training on new datasets to achieve good results, greatly reducing computational cost and time.
2. Enhancing model performance
Pre-trained models, while having generalized capabilities, may not perform well on specific tasks. Fine-tuning improves accuracy and efficiency by tuning model parameters with domain-specific data to make them more adept at handling the target task.
3. Adapting to new areas
Generalized pre-trained models may not understand the data characteristics of a specific domain well, and fine-tuning can help models adapt to new domains and make them better at handling the data in a specific task.
III. What does fine-tuning get you?
Fine-tuning yields an optimized and tuned model. This model is based on the structure of the original pre-trained model, but his parameters have been updated to better adapt to new tasks or domain requirements.
Examples:
Suppose there is a pre-trained image classification model that recognizes common objects. If there is a need to recognize specific types of flowers, the model can be fine-tuned with a new dataset containing various flower images and labels. After fine-tuning, the parameters of the model are updated to more accurately recognize these flower types.
IV. How to put the fine-tuned model into production use?
1. Deployment to production environment
Integration of models into websites, mobile apps, or other systems can be deployed using model servers or cloud services such as the APIs provided by TensorFlow Serving, TorchServe, or Hugging Face.
2. Reasoning tasks
Use the fine-tuned model for inference, such as making predictions given inputs or analyzing results.
3. Continuous updating and optimization
Based on new requirements or feedback, the model is further fine-tuned or more data is added for training to maintain optimal model performance.
V. How to choose a fine-tuning method?
- LoRA: Low-rank adaptation for reducing the size of fine-tuning parameters for resource-constrained environments.
- QLoRA: LoRA-based quantitative optimization for more efficient handling of large model fine-tuning.
- P-tuning: cue learning technique for small sample tasks or small amounts of labeled data.