AI Personal Learning
and practical guidance
讯飞绘镜

Easy Dataset: an easy tool for creating fine-tuned datasets for large models

General Introduction

Easy Dataset is an open source tool designed specifically for fine-tuning large models (LLMs), hosted on GitHub. It provides an easy-to-use interface that allows users to upload files, automatically segment content, generate questions and answers, and ultimately output a structured dataset suitable for fine-tuning. Developer Conard Li created this tool to help users transform domain knowledge into high-quality training data. It supports multiple export formats, such as JSON and Alpaca, and is compatible with all LLM APIs that follow the OpenAI format, making it easy to get started and create datasets quickly, whether you're a technical expert or a casual user.

Easy Dataset:创建大模型微调数据集的简易工具-1


 

Function List

  • Intelligent Document Processing: After uploading a Markdown file, the tool automatically splits it into smaller chunks.
  • Question generation: Automatically generate relevant questions based on the segmented text.
  • Answer Generation: Call the LLM API to generate detailed answers for each question.
  • Flexible editing: Support for modifying questions, answers or dataset content at any stage.
  • Multiple export formats: Data sets can be exported to JSON, JSONL or Alpaca formats.
  • Extensive model support: Compatible with all LLM APIs that follow the OpenAI format.
  • user-friendly interface: The design is intuitive and suitable for both technicians and non-technicians.
  • Customized tips: Allow the user to add system prompts that direct the model to generate a particular style of answer.

 

Using Help

Installation process

Easy Dataset offers two main ways to use it: deploying it via Docker or running it from local sources. Here are the detailed steps:

Installation via Docker

  1. Installing Docker
    If your computer does not already have Docker, download and install Docker Desktop. when the installation is complete, open a terminal to check for success:
docker --version

If the version number is displayed, it means it is installed.

  1. Pull the image and run
    Enter the following command in the terminal to pull the official image and start the service:
docker run -d -p 3000:3000 -v {你的本地路径}:/app/local-db --name easy-dataset conardli17/easy-dataset:latest
  • {你的本地路径} You need to replace it with the path to the folder on your computer used to store the data, for example C:\data(Windows) or /home/user/data(Linux/Mac).
  • -p 3000:3000 Indicates that port 3000 within the container is mapped to port 3000 locally.
  • -v is to save the data from being lost when the container is restarted.
  1. access interface
    After successful startup, open your browser and type http://localhost:3000You will see the Easy Dataset homepage. You will see the Easy Dataset homepage, click the "Create Project" button to get started.

Running locally through the source code

  1. Preparing the environment
  • Make sure you have Node.js (version 18.x or higher) and npm installed on your computer.
  • Checking method: Enter in the terminal node -v cap (a poem) npm -v, just see the version number.
  1. clone warehouse
    Enter it in the terminal:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
  1. Installation of dependencies
    Runs inside the project folder:
npm install
  1. Starting services
    Enter the following command to compile and run:
npm run build
npm run start

Once done, open your browser and visit http://localhost:3000You can access the tool interface.

Main function operation flow

Create a project

  1. Once on the homepage, click on the "Create Project" button.
  2. Enter the name of the project, e.g. "My Dataset".
  3. Click "Confirm" and the system will create a new project space for you.

Uploading and processing documents

  1. On the project page, find the "Text Split" or "Text Split" option.
  2. Click on "Upload File" and select a local Markdown file (e.g. example.md).
  3. After uploading, the tool will automatically split the file content into small segments. Each segment is displayed on the interface and you can manually adjust the splitting result.

Generating questions and answers

  1. Go to the "Questions" or "Question Management" page.
  2. Click on the "Generate Questions" button and the tool will generate questions based on each text.
  3. Check the generated question and if you are not satisfied, you can change it by clicking on the Edit button next to the question.
  4. Click "Generate Answers", select an LLM API (you need to configure the API key in advance) and the tool will generate answers for each question.
  5. Once the answers are generated, you can manually edit them to make sure the content meets the requirements.

Exporting a data set

  1. Go to the Datasets or Dataset Management screen.
  2. Click the "Export" button and choose the export format (e.g. JSON or Alpaca).
  3. The system will generate a file, click Download and save it locally.

Featured Function Operation

Configuring the LLM API

  1. On the Settings page, find Model Configuration.
  2. Enter your LLM API key (e.g. OpenAI's API Key).
  3. Select the model type (many common models are supported) and save the configuration.
  4. Once configured, this model will be called when generating answers.

Customized System Tips

  1. On the Settings page, find Prompts or Prompt Templates.
  2. Enter customized prompts, such as "Please answer the question in simple language".
  3. After saving, the answers will be generated with the style adjusted according to your prompts.

Data set optimization

  1. On the "Datasets" page, click the "Optimize" button.
  2. The system analyzes the dataset and removes duplicates or optimizes the formatting.
  3. The optimized dataset is more suitable for direct use in model fine-tuning.

caveat

  • If you are deploying with Docker, don't forget to take regular backups! {你的本地路径} The data in it.
  • When running locally, make sure the network is open, as generating answers requires an internet connection to call the API.
  • If you encounter an error, you can check the "Releases" page on GitHub to download the latest version to fix the problem.

 

application scenario

  1. Model developers fine-tune LLM
    Developers can use Easy Dataset to process technical documents, generate Q&A pairs, quickly produce training sets, and improve model performance in specific domains.
  2. Production of learning materials by educators
    Teachers can upload course handouts and generate questions and answers for student review or online course content creation.
  3. Researchers organize domain knowledge
    Researchers can upload papers or reports, extract key questions and answers, and organize them into structured data for analysis.

 

QA

  1. What file formats does Easy Dataset support?
    Currently the main support is for Markdown files (.md), and other format support may be added in the future.
  2. Do I need to provide my own LLM API?
    Yes, the tool itself does not provide LLM services and requires the user to configure their own API key, such as OpenAI or other compatible models.
  3. What models can the exported dataset be used for?
    As long as the model supports OpenAI formats (e.g. LLaMA, GPT, etc.), the exported dataset can be used directly.
May not be reproduced without permission:Chief AI Sharing Circle " Easy Dataset: an easy tool for creating fine-tuned datasets for large models
en_USEnglish