ImBD: AI-generated content detection to detect whether content is generated by artificial intelligence

Latest AI Resources7mos agoupdate AI Sharing Circle

1.7K 00

General Introduction

ImBD (Imitate Before Detect) is a pioneering machine-generated text detection project presented at AAAI 2025. With the widespread use of Large Language Models (LLMs) such as ChatGPT, recognizing AI-generated text content is becoming more and more challenging, and the ImBD project proposes a novel "Imitate Before Detect" approach that improves detection by deeply understanding and mimicking the stylistic features of machine text. This method is the first to propose the style preference of aligned machine text, and establishes a comprehensive text detection framework, which can effectively recognize machine-generated text that has been modified by human. The project adopts the Apache 2.0 open source license, which provides a complete code implementation, pre-trained models and detailed documentation to facilitate researchers and developers to conduct further research and application development based on this foundation.

Demo address: https://ai-detector.fenz.ai/ai-detector

Function List

Supports high-precision detection of machine-generated text
Provide pre-trained models for immediate deployment and use
Novel textual style feature alignment algorithm implemented
Includes detailed experimental datasets and evaluation benchmarks
Provide complete training and inference code
Supports customized training data for model fine-tuning
Includes detailed API documentation and usage examples
Provides command line tools for quick testing and evaluation
Supports batch text processing
Includes visualization tools to display test results

Using Help

1. Environmental configuration

First you need to configure your Python environment and install the necessary dependencies:

git clone https://github.com/Jiaqi-Chen-00/ImBD
cd ImBD
pip install -r requirements.txt

2. Data preparation

Before starting to use ImBD, training and test data need to be prepared. The data should contain the following two categories:

Manually prepared original text
Machine-generated or machine-modified text

Data format requirements:

Text files need to be UTF-8 encoded
Each sample takes up one row
It is proposed to divide the dataset into training set, validation set and test set in the ratio of 8:1:1

3. Model training

Run the following command to start training:

python train.py \
--train_data path/to/train.txt \
--val_data path/to/val.txt \
--model_output_dir path/to/save/model \
--batch_size 32 \
--learning_rate 2e-5 \
--num_epochs 5

4. Model evaluation

Evaluate model performance using test sets:

python evaluate.py \
--model_path path/to/saved/model \
--test_data path/to/test.txt \
--output_file evaluation_results.txt

5. Text detection

Detection of individual texts:

python detect.py \
--model_path path/to/saved/model \
--input_text "要检测的文本内容" \
--output_format json

Batch detection of text:

python batch_detect.py \
--model_path path/to/saved/model \
--input_file input.txt \
--output_file results.json

6. Advanced functions

6.1 Model fine-tuning

If you need to optimize for domain-specific text, you can fine-tune the model using your own dataset:

python finetune.py \
--pretrained_model_path path/to/pretrained/model \
--train_data path/to/domain/data \
--output_dir path/to/finetuned/model

6.2 Visualization analysis

Use the built-in visualization tools to analyze the test results:

python visualize.py \
--results_file path/to/results.json \
--output_dir path/to/visualizations

6.3 API Service Deployment

Deploy the model as a REST API service:

python serve.py \
--model_path path/to/saved/model \
--host 0.0.0.0 \
--port 8000

7. Caveats

It is recommended to use GPUs for model training to improve efficiency
Training data quality has a significant impact on model performance
Regularly update the model to accommodate new AI-generated text features
Pay attention to model versioning when deploying in production environments
It is recommended to save the test results for subsequent analysis and model optimization

8. Frequently asked questions

Q: What languages does the model support?
A: Currently, we mainly support English, other languages need to be trained with corresponding datasets.

Q: How can I improve the accuracy of my tests?
A: Performance can be improved by adding training data, tuning model parameters, and fine-tuning using domain-specific data.

Q: How can detection speed be optimized?
A: Detection speed can be improved by batch processing, model quantization, and using GPU acceleration.