WritingBench: a benchmarking assessment tool to test the writing skills of large models

Latest AI Resources5mos agorelease AI Sharing Circle

1.4K 00

General Introduction

WritingBench is an open source project developed by the X-PLUG team and hosted on GitHub. It is a tool specifically designed to test the writing ability of large models, providing 1239 real-world writing tasks. These tasks cover 6 major domains and 100 sub-domains, combining style, format and length requirements with an average of 1546 words per task. The project builds tasks through a combination of model generation and manual optimization to ensure variety and usefulness. Each task comes with 5 specific scoring criteria, which can be scored by either the big model or a dedicated judging model.WritingBench's code and data are free and open, making it suitable for developers to optimize the writing capabilities of the big model. Note that the program does not provide requirements.txt file, users need to configure their own environment.

Function List

Offers 1,239 authentic writing assignments in six fields: academia, business, law, literature, education, and marketing.
Supporting 100 segments, the tasks are close to the real needs.
Generate 5 dynamic scoring rubrics for each task to assess writing quality.
Supports both automatic scoring of large models and scoring of specialized judging models.
Includes diverse reference materials such as financial statements or legal templates.
Open source code, datasets and evaluation scripts are provided and can be freely downloaded and modified by the user.

Using Help

WritingBench is an open source project based on GitHub and users can visit https://github.com/X-PLUG/WritingBench for resources. It does not require an online service, just download it locally and run it. The following is a detailed step-by-step guide to its use and functionality:

Access to project resources

Open your browser and type https://github.com/X-PLUG/WritingBench.
Click the green "Code" button in the upper right corner and select "Download ZIP" to download it, or clone it with the Git command:

git clone https://github.com/X-PLUG/WritingBench.git

Unzip the file locally, the folder contains the code, data and documentation.

Preparing the runtime environment

WritingBench is not available requirements.txt file, so you need to install the Python environment and dependent libraries manually. The steps are as follows:

Ensure that Python 3.8 or later is installed by typing in the terminal python --version Check.
Go to the project folder:

cd WritingBench

Install basic dependency libraries. Officially, not all dependencies are explicitly listed, but the following libraries are presumed to be required based on functionality:

pip install torch(for judging models, may require GPU support).
pip install transformers(for large model operations).
pip install requests(may be used for data processing).
Other libraries that may be needed can be installed additionally based on the error message.

If you use a specialized judging model, you need to install PyTorch and CUDA, refer to https://pytorch.org/get-started/locally/ for the specific version.

Description of the project structure

The directory structure after downloading is as follows:

evaluate_benchmark.py: Evaluation scripts.
prompt.py: Tip template.
evaluator/: Evaluate the interface catalog.
critic.py:: Dedicated judging model interface.
llm.py: Large model evaluation interfaces.
benchmark_query/: Mission Data Catalog.
benchmark_all.jsonl: Complete 1239 task dataset.
requirement/: A subset categorized by style, format, and length.

Using Writing Task Data

show (a ticket) benchmark_query/benchmark_all.jsonlView 1239 tasks.
Each assignment includes a description, fields, and reference materials. For example, "Write a 500-word summary for the 2023 Q3 financial report."
Generate answers with your big model, sample code:

from your_model import Model
task = "为2023年Q3财务报告写500字总结"
model = Model()
response = model.generate(task)
with open("response.txt", "w") as f:
f.write(response)

Operational assessment tools

WritingBench supports two types of evaluation:

Large model scoring

compiler evaluator/llm.pyThe following is an example of an API configuration that you can add:

self.api_key = "your_api_key_here"
self.url = "Your API endpoint"
self.model = "Your model name"

Run the evaluation script:

python evaluate_benchmark.py --evaluator llm --query_criteria_file benchmark_query/benchmark_all.jsonl --input_file response.txt --output_file scores.jsonl

The output consists of scores and rationales for the 5 rating criteria.

Specialized judging model scores

Download the rubric model from https://huggingface.co/AQuarterMile/WritingBench-Critic-Model-Qwen-7B.
Place the model in the local path, edit evaluator/critic.py::

self.model = LLM(model="path/to/critic_model", tensor_parallel_size=1)

Operational assessment:

python evaluate_benchmark.py --evaluator critic --query_criteria_file benchmark_query/benchmark_all.jsonl --input_file response.txt --output_file scores.jsonl

The output shows the score (0-10) for each criterion.

Customized tasks and scoring

exist benchmark_query/ Add a new JSON file to the task description and materials.
modifications prompt.py or assessment scripts to adjust the scoring criteria.
After testing, you can upload it to GitHub and share it.

Data generation process

Tasks are generated in the following ways:

The large model generates initial tasks from 6 major domains and 100 subdomains.
Optimize tasks through style adjustments, formatting requirements, etc.
30 markers to collect open source material.
5 experts screen tasks and materials to ensure relevance.

These steps help users get up to speed quickly with WritingBench, testing and optimizing large model writing capabilities.

application scenario

model development
Developers use WritingBench to test the model's performance in academic papers or advertising copy to improve deficiencies.
Educational research
Researchers analyze the ability of large models to generate instructional materials or to correct essays.
writing aid
Users inspire creativity with task data or check the quality of articles with a scoring tool.

QA

Why is there no requirements.txt file?
It is not officially provided, probably to give users the flexibility to configure dependencies according to their model and environment.
Do I need to network?
No need, just download and run locally, but internet connection is required to download models or dependencies.
How is the judging model obtained?
Downloaded from https://huggingface.co/AQuarterMile/WritingBench-Critic-Model-Qwen-7B.