Previously, SiliconCloud went online with a language modelingIn-line LoRA fine-tuning function. By simply uploading corpus data and creating a fine-tuning task, you can obtain a proprietary fine-tuned language model.
Recently, SiliconCloud's LLM online LoRA fine-tuning has been extended with Qwen2.5-32B, Qwen2.5-14B, and Llama-3.1-8B models as fine-tuning base models, which further enriches the playability of the fine-tuning, and further reduces the cost of training and using the fine-tuning models.
In fact, fine-tuning a proprietary big language model is quite simple. Using SiliconCloud's fine-tuning feature, you can prepare just a few dozen training corpora to make "significant" changes to your model from the base model.
preliminary
First, we need to be clear about what fine-tuning is trying to accomplish. Often, fine-tuning can help a model learn knowledge not included in the training, or develop a certain distinctive linguistic style.
Our approach will vary for different purposes. For example, if you want the model to learn certain domain knowledge, you need the training data to contain questions and answers based on these data, while if you want the model to learn a certain typical language style, you need dialog data with this style.
Data preparation
Fine-tuning of the LLM model requires that the data be provided in the form of a dialog with a data structure that is essentially the same as the invocation message of the LLM. It can be said that the text generation interface will be invoked to construct the fine-tuned data.
Below is an example of a typical piece of fine-tuned data:
{"messages": [{"role": "system", "content": "You are a fire system expert"}, {"role": "user", "content": "How do I install a fire alarm in my home?"} , {"role": "assistant", "content": "According to the Home Fire Safety System (GB22370-2008), home fire alarms should be installed in accordance with the manufacturer's requirements. Specifically, the installation should ensure that the device is in a normal monitoring state and connected to the appropriate fire detectors and other components. The installation location should avoid high temperature, humidity and dusty places to ensure the normal operation of the device."}]}
To perform a fine-tuning, it is usually necessary to prepare dozens of such data and save them as a "jsonl" file in the form of one per line:
{"messages": [{"role": "system", "content": "You are a fire system expert"}, {"role": "user", "content": "How do I install a fire alarm in my home?"} , {"role": "assistant", "content": "According to the Home Fire Safety System (GB22370-2008), home fire alarms should be installed in accordance with the manufacturer's requirements. Specifically, the installation should ensure that the device is in a normal monitoring state and connected to the appropriate fire detectors and other components. The installation location should avoid high temperatures, humidity and dusty areas to ensure proper operation of the device."}]} {"messages": [{"role": "system", "content": "You are a fire system expert"}, {"role": "user", "content": "What are the power requirements for fire alarms?"} , {"role": "assistant", "content": "According to the Home Fire Safety System (GB22370-2008), the main power supply of the home fire alarm controller should be 220V, 50Hz AC power supply, and the input end of the power cord should be equipped with a terminal. The backup power supply should be a battery that can be charged and discharged several times, and in the case of not exceeding the limit of discharge specified by the manufacturer, the battery should be able to be charged to the rated capacity of more than 80% within 24 hours, and then should be able to be fully charged after 48 hours."}]} {"messages": [{"role": "system", "content": "You are a fire system expert"}, {"role": "user", "content": "What are the requirements for the acoustic and visual signals of a fire alarm?"} , {"role": "assistant", "content": "According to the Home Fire Safety System (GB22370-2008), the sound pressure level of the acoustic signal of the fire alarm shall be not less than 75dB (A-weighted) at 3 meters in at least one direction, and the sound pressure level at 3 meters in any direction shall be not greater than 120 dB (A-weighted) at 3 meters in any direction. The optical signal of the fire light alarm shall be clearly visible at a distance of 25 meters under an ambient light of 100lx to 500lx."}]} {"messages": [{"role": "system", "content": "You are an expert in fire protection systems"}, {"role": "user", "content": "What to do with the fault signal of a fire alarm?"} , {"role": "assistant", "content": "According to the Home Fire Safety System (GB22370-2008), when there is a communication failure with the connected components, disconnection or short-circuit of the connecting wire between the backup power charger and the backup power supply, disconnection or short-circuit of the connecting wire between the backup power supply and its loads When there is a communication failure between the connected components, a broken or short-circuited line between the standby power charger and the standby power supply, a broken or short-circuited line between the standby power supply and its loads, or an under-voltage of the main power supply, the home fire alarm controller shall emit an acoustic and optical signal of the fault and indicate the type of the fault within 100 seconds. The audible fault signal shall be able to be canceled manually, and shall be able to be activated again when there is another fault signal input."}]} {"messages": [{"role": "system", "content": "You are a fire system expert"}, {"role": "user", "content": "How does the self-test function of a fire alarm work?"} , {"role": "assistant", "content": "According to the Home Fire Safety System (GB22370-2008), the home fire alarm controller should be able to perform a function check (self-test) on the unit. During the execution of the self-test function, neither the external devices controlled by it nor the output contacts should operate. If the self-test time exceeds 1 minute or the self-test function cannot be stopped automatically, the fire alarm functions of the non-self-tested parts and the controller itself shall not be affected."}]} {"messages": [{"role": "system", "content": "You are a fire system expert"}, {"role": "user", "content": "What are the requirements for displaying messages on a fire alarm?"} , {"role": "assistant", "content": "According to the provisions of "Home Fire Safety System" (GB22370-2008), the home fire alarm controller that adopts numerical and alphabetical display of the working status shall arrange the information display level from high to low according to the order of displaying the fire alarm, supervisory alarm and other status. High-level status information should be prioritized for display, and the display of low-level status information should not affect the display of high-level status information. The displayed messages shall be consistent with the corresponding status and easily recognizable."}]} {"messages": [{"role": "system", "content": "You are a fire system expert"}, {"role": "user", "content": "What is the sound pressure level of the fire alarm?"} , {"role": "assistant", "content": "According to the Home Fire Safety System (GB22370-2008), the sound pressure level of the home fire alarm controller in various alarm states (1 meter directly in front) should be not less than 75dB (A weighting)."}]} {"messages": [{"role": "system", "content": "You are a fire system expert"}, {"role": "user", "content": "How to install a combustible gas detector?"} , {"role": "assistant", "content": "According to the Combustible Gas Alarm Controller (GB16808-2008), combustible gas detectors should be installed according to the manufacturer's requirements and connected to the combustible gas alarm controller. The installation location should avoid high temperature, humidity and dusty places to ensure the normal operation of the equipment."}]} {"messages": [{"role": "system", "content": "You are a fire protection system expert"}, {"role": "user", "content": "What are the requirements for the alarm function of the combustible gas detector?"} , {"role": "assistant", "content": "According to "Combustible Gas Alarm Controller" (GB16808-2008), combustible gas detector should have the function of low-limit alarm or low-limit and high-limit two-segment alarm. When there is a combustible gas alarm signal input, the controller shall issue an alarm sound and light signal within 10 seconds and hold it until manual reset."}]} {"messages": [{"role": "system", "content": "You are a fire system expert"}, {"role": "user", "content": "How to check the insulation of a fire alarm?"} , {"role": "assistant", "content": "According to the "Household Fire Safety System" (GB22370-2008), measure the insulation resistance value after applying 500V±50V DC voltage to the specimen's external energized terminals and chassis respectively through the insulation resistance test device for 60 seconds±5 seconds. Measure the insulation resistance value. The insulation resistance value between the external energized terminals of the specimen with insulation requirements and the chassis shall be not less than 20MΩ; the insulation resistance value between the power input terminal of the specimen and the chassis shall be not less than 50MΩ."}]}
If we have accumulated a relatively large amount of high-quality data in our daily business, we can start fine-tuning by constructing data files that meet the format requirements through simple data cleaning. If we lack data accumulation, or the cleaning process is more complicated, we can try to fine-tune with the help of big models and construct data with models.
Next, let's use an example to practice the process of constructing data. Let's try to train a large model of a "firefighting expert" by constructing questions and answers using firefighting standards.
Tectonic fine-tuning data
SiliconCloud provides a Qwen2.5 model with 128K contexts, which is more than enough to cover quite rich content.
First, we need to prepare locally the file to be used to construct the data, for example, GB22370-2008 Domestic Fire Safety Systems.pdf.
Next, we need to extract the contents of the file, write the prompt, and have the Qwen Big Model generate data that meets the above requirements. This can be done either in code, or using the Dify and other interface tools to complete.
Take the example of configuring a workflow in Dify:
1. The Start node configuration file is uploaded, along with the fill-in fields for the System Prompt in the training corpus;
2. Use the document extraction node and code node that comes with Dify to merge the contents of the file into text;
3. Using the LLM node, call SiliconCloud's Qwen/Qwen2.5-72B-Instruct-128K model (if the length of the file is within 32K, you can choose another model provided by the platform), if you can't find this model, you can add it by the following steps:
-
-
- Settings - Model Vendor, select SiliconFlow;
- Click Add Model, refer to the image to fill in the required parameters and save.
-
- A batch of corpus data can be obtained after the run by using the output of the big model as the output of the workflow;
- Next simply copy, save the contents as a .jsonl file and upload it in SiliconCloud to create the fine-tuning task.
If the data generated at one time is not sufficient, multiple runs can be made to construct more data.
It is important to note that more fine-tuned data is not necessarily betterEven if we have only 1 piece of data, we can start fine-tuning. Even if there is only 1 piece of data, we can start fine-tuning; on the contrary, if we have a lot of data that is not of good enough quality, the fine-tuning effect may not be as expected.
app. description: 'Upload a file, generate daily Q&A content based on the file content, use Qwen2.5 model with SiliconCloud 128K context, corpus data in JSONL format' ⚠️ Notes: - Note: Due to Dify limitations, files with more than 80,000 characters will be truncated. - Generated content is for reference only, there may be illusions or content errors, formatting errors, please pay attention to screening ' icon: 🤖 icon_background: '#FFEAD5' mode: workflow name: Fine-tune corpus builder use_icon_as_answer_icon: false kind: app version: 0.1.5 workflow. conversation_variables: [] environment_variables: [] features. file_upload: [] environment_variables: [] features. allowed_file_extensions. - .JPG - .JPG - .JPG - .GIF - .WEBP - .SVG allowed_file_types. - .GIF .WEBP .SVG allowed_file_upload_methods. - local_file - remote_url enabled: false fileUploadConfig: audio_file_size_limit: 50 audio_file_size_limit: 50 batch_count_limit: 5 file_size_limit: 15 image_file_size_limit: 10 video_file_size_limit: 100 workflow_file_upload_limit: 10 image. enabled: false number_limits: 3 workflow_file_upload_limit: 10 image: enabled: false number_limits: 3 transfer_methods. - local_file - remote_url number_limits: 3 opening_statement: '' retriever_resource: local_file remote_url number_limits: 3 opening_statement: '' enabled: true sensitive_word_avoidance: enabled: false enabled: false speech_to_text. enabled: false suggested_questions: [] enabled: false speech_to_text: enabled: false suggested_questions: [] suggested_questions_after_answer: enabled: false enabled: false text_to_speech: enabled: false voice: '' graph: false edges. - data: isInIteration: false isInIteration: false sourceType: start targetType: document-extractor id: 1735807686274-source-1735807758092-target source: '1735807686274' sourceHandle: source target: '1735807758092' targetHandle: target targetHandle: target zIndex: 0 - targetHandle: target type: custom zIndex: 0 data. targetHandle: target type: custom zIndex: 0 data: isInIteration: false sourceType: document-extractor targetType: code id: 1735807758092-source-1735807761855-target source: '1735807758092' sourceHandle: source target: '1735807761855' targetHandle: target targetHandle: target zIndex: 0 - targetHandle: target type: custom zIndex: 0 data. targetHandle: target type: custom zIndex: 0 data: isInIteration: false sourceType: code targetType: llm id: 1735807761855-source-1735807764975-target source: '1735807761855' sourceHandle: source target: '1735807764975' targetHandle: target targetHandle: target zIndex: 0 - targetHandle: target type: custom zIndex: 0 targetHandle: target type: custom zIndex: 0 data: isInIteration: false sourceType: llm targetType: end id: 1735807764975-source-1735807769820-target source: '1735807764975' sourceHandle: source target: '1735807769820' targetHandle: target type: custom zIndex: 0 targetHandle: target type: custom zIndex: 0 - target type: custom zIndex: 0 nodes: 0 data. desc: '' selected: false title: Start type: start variables: allowed_file_extensions: [] - allowed_file_extensions: [] allowed_file_types: [] - [] allowed_file_types: document allowed_file_upload_methods: [] allowed_file_types: document - local_file - remote_url label: corpus document max_length: 10 options: [] required: true type: file-list variable: attachments - allowed_file_extensions: [] allowed_file_types. - [] allowed_file_types: image allowed_file_upload_methods: [] allowed_file_types: image - local_file - remote_url label: trigger word (system prompt in training) max_length: 48 options: [] required: true type: text-input variable: trigger id: '1735807686274' id: '1735807686274' position. x: 30 y: 258 positionAbsolute. x: 30 y: 258 selected: false sourcePosition: right targetPosition: left type: custom data: data: y: 258 selected: false sourcePosition: right targetPosition: left - data. is_array_file: true is_array_file: true selected: false title: Document Extractor type: document-extractor variable_selector. - '1735807686274' - attachments height: 92 id: '1735807758092' position. x: 334 y: 258 positionAbsolute. x: 334 y: 258 selected: false sourcePosition: right targetPosition: left type: custom data: data: y: 258 selected: false sourcePosition: right targetPosition: left - data. code: "def main(articleSections: list) -> dict:\n try:\n # Combine list items into string \n\ \ combined_text = \"\\n\".join(articleSections)\n \n \n \n \\ # Intercept the first 80,000 characters \n truncated_text = combined_text[:80,000]\n \n \n \n \n return {\n \"result\": truncated_text\n \n \n }\n except Exception as e:\n # Error Handling \n return {\n \"result\": truncated_text \n \n \n \ \ \ "result\": \ "\"\ \n }" code_language: python3 desc: '' outputs. children: null children: null type: string selected: false title: Code execution type: code variables: value_selector: value_selector - value_selector. - '1735807758092' - variables: value_selector: '1735807758092' variable: articleSections height: 54 id: '1735807761855' position. x: 638 y: 258 positionAbsolute. x: 638 y: 258 selected: false sourcePosition: right targetPosition: left type: custom data: data: y: 258 selected: false sourcePosition: right targetPosition: left - data. context. enabled: false variable_selector: [] desc: '' model: [] desc: ''' completion_params: [] desc: '' model: [] completion_params. frequency_penalty: 0.5 max_tokens: 4096 temperature: 0.3 mode: chat name: Qwen/Qwen2.5-72B-Instruct-128K provider: Qwen/Qwen2.5-72B-Instruct-128K siliconflow prompt_template. - id: b6913d40-d173-45d8-b012-98240d42a196 role: system text: '[Role You are an LLM Large Language Modeling Scientist who helps users construct fine-tuned data that conforms to specifications, taking into account user-supplied content! Tasks - For a given piece of Content, you list 10 generic Questions at a time; - For each question, provide an answer by quoting the original text of the Content and a reasonable explanation or deduction of the Content; - And organize the questions and answers into a standardized JSONL format. [Requirements 1. Questions **Do not** quote directly from the "content", they should be close to contemporary real life; 2. The question should be in the vernacular, avoiding "false, big and empty"; 3. the answer should be faithful to the original text, and the interpretation of the original text should not be detached from the main idea or thought of the original text; Output specification * Output canonical JSONL, one line of data each * Each line of data should contain an array of messages, each of which should contain three rows with the roles system, user, and assistant. * The data where role is system is particularly important as the system prompt in training, and its content uses the user-specified "trigger word". * The data whose role is user corresponds to the listed "question". * The data whose role is assistant corresponds to the "answer" to the "question". * The following is an example: ``` {"messages": [{"role": "system", "content": "You are a contemporary scholar"}, {"role": "user". "content": "How should I study?"} , {"role": "assistant", "content": "Hsien-hsien is easy to color; to serve one's parents, one can do one's best; to serve one's ruler, one can achieve one's goal; to make friends with one's friends, one is true to one's word. Though I say I have not yet learned, I shall say that I have learned."}]} ```'' - id: 61530521-14cf-4eaf-8f06-a4bc89db3cb1 role: user text: '``Content'' {{#1735807761855.result#}} "Trigger word" {{#1735807686274.trigger#}}' selected: false title: LLM type: llm variables: [] vision. enabled: false height: 98 id: '1735807764975' variables: [] vision: enabled: false height: 98 id: '1735807764975' x: 942 y: 258 positionAbsolute. x: 942 y: 258 selected: true sourcePosition: right targetPosition: left type: custom data: data: y: 258 - data. desc: '' outputs: value_selector: '' - value_selector. - '1735807764975' - outputs: value_selector: '1735807764975' variable: text selected: false title: end type: end height: 90 id: '1735807769820' position. x: 1246 y: 258 positionAbsolute. x: 1246 y: 258 selected: false sourcePosition: right targetPosition: left type: custom data: data: y: 258 selected: false sourcePosition: right targetPosition: left - data. author: Dify desc: '' height: 88 selected: false showAuthor: true text: '{"root":{"children":[{"children":[{"detail":0, "format":0, "mode": "normal", "style":"", "text": "Set lower Temperature to improve the stability of the output format", "type": "text", "version":1}], "direction": "ltr", "format":"", "indent":0, "type": "paragraph", "version":1," textFormat":0}], "direction": "ltr", "format":"", "indent":0, "type": "root", "version":1}}' theme: blue title: '' type: '' theme: blue title: ''' type: ''' height: 88 id: '1735808753316' position. x: 951.4285714285714 y: 375.7142857142857 positionAbsolute. x: 951.4285714285714 y: 375.7142857142857 y: 375.7142857142857 selected: false sourcePosition: right targetPosition: left type: custom-note data: dataSet: y: 37514857142857 selected: false - data. author: Dify desc: '' height: 88 selected: false showAuthor: true text: '{"root":{"children":[{"children":[{"detail":0, "format":0, "mode": "normal", "style":"", "text": "Merge the content of multiple documents and capture the first 8W characters", "type": "text", "version":1}], "direction": "ltr", "format":"", "indent":0, "type": "paragraph", "version":1, "textFormat":0}]," direction": "ltr", "format":"", "indent":0, "type": "root", "version":1}}' theme: blue title: '' type: '' theme: blue title: ''' type: ''' height: 88 id: '1735808799815' position. x: 640 y: 338.5714285714286 positionAbsolute. x: 640 y: 338.5714285714286 selected: false sourcePosition: right targetPosition: left type: custom-note viewport: viewport: viewport.com viewport. x: 0 y: 0 zoom: 0.7
In addition to using tools such as Dify, directly using models such as Qwen/Qwen2.5-Coder-32B-Instruct to assist in writing scripts to run is also an efficient way to construct fine-tuned data, and it can circumvent a number of limitations of the three-party platforms, fully utilize the advantages of the SiliconCloud platform's model-rich, efficient inference, and harvest more high-quality corpus data. Data.
Fine-tuning training and validation
Create a new fine-tuning task on the SiliconCloud platform, upload and select the .jsonl file you just saved, and click Start Fine-Tuning. Wait for the fine-tuning task to finish queuing, and once the execution is complete, you can use the fine-tuned model or validate the results through an online experience or API. The Compare feature of the Dialog Model Online Experience also allows you to compare model responses and outputs across multiple Checkpoint and base models, choosing the model that works best to continue using. Fine-tuning model vs. base model: Comparison between multiple checkpoints: At this point, we have used local files to build LLM fine-tuned data and fine-tuned a Fire Expert LLM. of course, with the above steps, you can also try to build your own domain-specific LLM.