General Introduction
LaWGPT is an open source project supported by the Machine Learning and Data Mining Research Group of Nanjing University, which is dedicated to building a large language model based on Chinese legal knowledge. It extends the proprietary word lists in the legal domain on the basis of generalized Chinese models (e.g., Chinese-LLaMA and ChatGLM), and significantly improves the model's semantic comprehension and conversation ability in legal scenarios through large-scale legal corpus pre-training and instruction fine-tuning of legal Q&A datasets. The project is promoted by multiple collaborators and is applicable to scenarios such as legal conversations and judicial exam training. Although the model is still limited by data and capacity, and the output may be uncertain, its open source nature and community support make it an important resource for AI research in the legal field.
Function List
- Legal Q&A Generation: Generate accurate answers based on inputted legal questions, suitable for counseling and learning.
- Judicial examination training: Provides Q&A training based on the China Judicial Exam dataset to help users prepare for the exam.
- Legal Corpus Comprehension: Pre-training to be able to parse complex legal instruments and statutory content.
- Command Line Batch Reasoning: Supports developers in batch processing of law-related data through scripts.
- Interactive mode dialog: Interactively answer user questions in real time when no predefined data is available.
- Model Weighting Support: LoRA weights are provided to allow the user to make customized adjustments in conjunction with the original model.
Using Help
Installation process
LaWGPT is a GitHub-based open source project , you need to install the environment and dependencies before use. The following are the detailed installation steps:
- Cloning Project Code
Open a terminal and enter the following command to download the code locally:
git clone git@github.com:pengxiao-song/LaWGPT.git
cd LaWGPT
This will clone the LaWGPT codebase to your computer and go into the project directory.
- Creating a Virtual Environment
Use Conda to create a separate Python environment and avoid dependency conflicts:
conda create -n lawgpt python=3.10 -y
conda activate lawgpt
After activating the environment, subsequent operations will be performed on the lawgpt
environment to carry out.
- Installation of dependencies
The program providesrequirements.txt
file that lists the required libraries. Run the following command to install them:
pip install -r requirements.txt
Dependencies include transformers
,peft
,gradio
etc. to ensure that the network is free to complete the download.
- Getting model weights
Since LLaMA and Chinese-LLaMA do not open source the full weights, LaWGPT only provides LoRA weights. You need to:
- Obtain weights for Chinese-LLaMA or other base models from official sources.
- Merge LoRA weights with the base model (see project documentation for details on how to do this).
- Verify Installation
Run the sample script to confirm that the environment is correct:
bash scripts/infer.sh
If you successfully enter interactive mode, the installation is complete.
Usage
Main Functional Operations: Legal Quizzing and Reasoning
- interactive mode
When the test data path is not specified, run thebash scripts/infer.sh
It will go into interactive mode. You can enter legal questions directly, for example:
Please explain the content of article 10 of the Contract Law of the People's Republic of China.
The model generates answers in real time and is suitable for quick consultations or learning.
- batch inference
To handle multiple questions, prepare a JSON file (format reference)resources/example_instruction_train.json
), for example:
{"instruction": "How is property divided after a divorce?" , "output": ""}
Pass the file path into the script:
bash scripts/infer.sh --infer_data_path . /test.json
The model processes and outputs the results line by line, and the results can be saved for subsequent analysis.
Featured Feature Operation: Judicial Exam Training
- Preparing the dataset
LaWGPT supports training based on the Judicial Exam dataset. You can refer toAwesome Chinese Legal Resources
Download the publicly available dataset, or construct your own Q&A pairs in the following format:{"instruction": "Which of the following is not an element of a crime?" , "output": "A. Subject of the crime B. Object of the crime C. Motive for the crime D. Objective aspects of the crime"}
Save as a JSON file, e.g.
exam_data.json
The - running training
utilizationfinetune.py
Scripts for command fine-tuning:python finetune.py --data_path . /exam_data.json ---base_model --lora_weights
Parameter Description:
--data_path
: The dataset path.---base_model
: Base model paths.--lora_weights
: LoRA weight path.
Once the training is complete, the model will be more adaptable to judicial exam type questions.
Web Interface Usage
- Starting the WebUI
Project support provides a graphical interface via Gradio. Runs:bash scripts/webui.sh
Upon startup, the browser opens a local page (usually the
http://127.0.0.1:7860
). - workflow
- Enter a legal question in the input box, e.g., "How do I apply for patent protection?"
- Click "Submit" and wait for the model to generate a response.
- View the output, which can be copied or saved.
The web interface is suitable for non-technical users and is intuitive to use.
caveat
- hardware requirement: It is recommended to use a GPU (e.g. Tesla V100) to accelerate inference, CPU operation may be slower.
- Model Selection: The default is to use
LaWGPT-7B-alpha
If you need tobeta 1.0
maybebeta 1.1
The model parameters in the script need to be adjusted. - limitations: Models may generate inaccurate content due to data limitations, and the results need to be validated when used, especially in real legal scenarios.
With these steps, you can easily get started with LaWGPT and get efficient support whether you are conducting legal quizzes or preparing for judicial exams.