General Introduction
ImBD (Imitate Before Detect) is a pioneering machine-generated text detection project presented at AAAI 2025. With the widespread use of Large Language Models (LLMs) such as ChatGPT, recognizing AI-generated text content is becoming more and more challenging, and the ImBD project proposes a novel "Imitate Before Detect" approach that improves detection by deeply understanding and mimicking the stylistic features of machine text. This method is the first to propose the style preference of aligned machine text, and establishes a comprehensive text detection framework, which can effectively recognize machine-generated text that has been modified by human. The project adopts the Apache 2.0 open source license, which provides a complete code implementation, pre-trained models and detailed documentation to facilitate researchers and developers to conduct further research and application development based on this foundation.
Function List
- Supports high-precision detection of machine-generated text
- Provide pre-trained models for immediate deployment and use
- Novel textual style feature alignment algorithm implemented
- Includes detailed experimental datasets and evaluation benchmarks
- Provide complete training and inference code
- Supports customized training data for model fine-tuning
- Includes detailed API documentation and usage examples
- Provides command line tools for quick testing and evaluation
- Supports batch text processing
- Includes visualization tools to display test results
Using Help
1. Environmental configuration
First you need to configure your Python environment and install the necessary dependencies:
git clone https://github.com/Jiaqi-Chen-00/ImBD
cd ImBD
pip install -r requirements.txt
2. Data preparation
Before starting to use ImBD, training and test data need to be prepared. The data should contain the following two categories:
- Manually prepared original text
- Machine-generated or machine-modified text
Data format requirements:
- Text files need to be UTF-8 encoded
- Each sample takes up one row
- It is proposed to divide the dataset into training set, validation set and test set in the ratio of 8:1:1
3. Model training
Run the following command to start training:
python train.py \
---train_data path/to/train.txt \
--val_data path/to/val.txt \\
---model_output_dir path/to/save/model \\
---batch_size 32 \
--learning_rate 2e-5 \\
--num_epochs 5
4. Model evaluation
Evaluate model performance using test sets:
python evaluate.py \
--model_path path/to/saved/model \
--test_data path/to/test.txt \
--output_file evaluation_results.txt
5. Text detection
Detection of individual texts:
python detect.py \
---model_path path/to/saved/model \
--input_text "Text content to be detected" \
--output_format json
Batch detection of text:
python batch_detect.py \
---model_path path/to/saved/model \\
--input_file input.txt \
--output_file results.json
6. Advanced functions
6.1 Model fine-tuning
If you need to optimize for domain-specific text, you can fine-tune the model using your own dataset:
python finetune.py \
--pretrained_model_path path/to/pretrained/model \\
---train_data path/to/domain/data \
--output_dir path/to/finetuned/model
6.2 Visualization analysis
Use the built-in visualization tools to analyze the test results:
python visualize.py \
--results_file path/to/results.json \
--output_dir path/to/visualizations
6.3 API Service Deployment
Deploy the model as a REST API service:
python serve.py \
---model_path path/to/saved/model \\
--host 0.0.0.0 \
--port 8000
7. Caveats
- It is recommended to use GPUs for model training to improve efficiency
- Training data quality has a significant impact on model performance
- Regularly update the model to accommodate new AI-generated text features
- Pay attention to model versioning when deploying in production environments
- It is recommended to save the test results for subsequent analysis and model optimization
8. Frequently asked questions
Q: What languages does the model support?
A: Currently, we mainly support English, other languages need to be trained with corresponding datasets.
Q: How can I improve the accuracy of my tests?
A: Performance can be improved by adding training data, tuning model parameters, and fine-tuning using domain-specific data.
Q: How can detection speed be optimized?
A: Detection speed can be improved by batch processing, model quantization, and using GPU acceleration.