Easy Dataset: an easy tool for creating fine-tuned datasets for large models

Latest AI Resources7mos agorelease AI Sharing Circle

30.9K 00

General Introduction

Easy Dataset is an open source tool designed specifically for fine-tuning large models (LLMs), hosted on GitHub. It provides an easy-to-use interface that allows users to upload files, automatically segment content, generate questions and answers, and ultimately output a structured dataset suitable for fine-tuning. Developer Conard Li created this tool to help users transform domain knowledge into high-quality training data. It supports multiple export formats, such as JSON and Alpaca, and is compatible with all LLM APIs that follow the OpenAI format, making it easy to get started and create datasets quickly, whether you're a technical expert or a casual user.

Function List

Intelligent Document Processing: After uploading a Markdown file, the tool automatically splits it into smaller chunks.
Question generation: Automatically generate relevant questions based on the segmented text.
Answer Generation: Call the LLM API to generate detailed answers for each question.
Flexible editing: Support for modifying questions, answers or dataset content at any stage.
Multiple export formats: Data sets can be exported to JSON, JSONL or Alpaca formats.
Extensive model support: Compatible with all LLM APIs that follow the OpenAI format.
user-friendly interface: The design is intuitive and suitable for both technicians and non-technicians.
Customized tips: Allow the user to add system prompts that direct the model to generate a particular style of answer.

Using Help

Installation process

Easy Dataset offers two main ways to use it: deploying it via Docker or running it from local sources. Here are the detailed steps:

Installation via Docker

Installing Docker
If your computer does not already have Docker, download and install Docker Desktop. when the installation is complete, open a terminal to check for success:

docker --version

If the version number is displayed, it means it is installed.

Pull the image and run
Enter the following command in the terminal to pull the official image and start the service:

docker run -d -p 3000:3000 -v {你的本地路径}:/app/local-db --name easy-dataset conardli17/easy-dataset:latest

{你的本地路径} You need to replace it with the path to the folder on your computer used to store the data, for example C:\data(Windows) or /home/user/data(Linux/Mac).
-p 3000:3000 Indicates that port 3000 within the container is mapped to port 3000 locally.
-v is to save the data from being lost when the container is restarted.

access interface
After successful startup, open your browser and type http://localhost:3000You will see the Easy Dataset homepage. You will see the Easy Dataset homepage, click the "Create Project" button to get started.

Running locally through the source code

Preparing the environment

Make sure you have Node.js (version 18.x or higher) and npm installed on your computer.
Checking method: Enter in the terminal node -v cap (a poem) npm -v, just see the version number.

clone warehouse
Enter it in the terminal:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Installation of dependencies
Runs inside the project folder:

npm install

Starting services
Enter the following command to compile and run:

npm run build
npm run start

Once done, open your browser and visit http://localhost:3000You can access the tool interface.

Main function operation flow

Create a project

Once on the homepage, click on the "Create Project" button.
Enter the name of the project, e.g. "My Dataset".
Click "Confirm" and the system will create a new project space for you.

Uploading and processing documents

On the project page, find the "Text Split" or "Text Split" option.
Click on "Upload File" and select a local Markdown file (e.g. example.md).
After uploading, the tool will automatically split the file content into small segments. Each segment is displayed on the interface and you can manually adjust the splitting result.

Generating questions and answers

Go to the "Questions" or "Question Management" page.
Click on the "Generate Questions" button and the tool will generate questions based on each text.
Check the generated question and if you are not satisfied, you can change it by clicking on the Edit button next to the question.
Click "Generate Answers", select an LLM API (you need to configure the API key in advance) and the tool will generate answers for each question.
Once the answers are generated, you can manually edit them to make sure the content meets the requirements.

Exporting a data set

Go to the Datasets or Dataset Management screen.
Click the "Export" button and choose the export format (e.g. JSON or Alpaca).
The system will generate a file, click Download and save it locally.

Featured Function Operation

Configuring the LLM API

On the Settings page, find Model Configuration.
Enter your LLM API key (e.g. OpenAI's API Key).
Select the model type (many common models are supported) and save the configuration.
Once configured, this model will be called when generating answers.

Customized System Tips

On the Settings page, find Prompts or Prompt Templates.
Enter customized prompts, such as "Please answer the question in simple language".
After saving, the answers will be generated with the style adjusted according to your prompts.

Data set optimization

On the "Datasets" page, click the "Optimize" button.
The system analyzes the dataset and removes duplicates or optimizes the formatting.
The optimized dataset is more suitable for direct use in model fine-tuning.

caveat

If you are deploying with Docker, don't forget to take regular backups! {你的本地路径} The data in it.
When running locally, make sure the network is open, as generating answers requires an internet connection to call the API.
If you encounter an error, you can check the "Releases" page on GitHub to download the latest version to fix the problem.

application scenario

Model developers fine-tune LLM
Developers can use Easy Dataset to process technical documents, generate Q&A pairs, quickly produce training sets, and improve model performance in specific domains.
Production of learning materials by educators
Teachers can upload course handouts and generate questions and answers for student review or online course content creation.
Researchers organize domain knowledge
Researchers can upload papers or reports, extract key questions and answers, and organize them into structured data for analysis.

QA

What file formats does Easy Dataset support?
Currently the main support is for Markdown files (.md), and other format support may be added in the future.
Do I need to provide my own LLM API?
Yes, the tool itself does not provide LLM services and requires the user to configure their own API key, such as OpenAI or other compatible models.
What models can the exported dataset be used for?
As long as the model supports OpenAI formats (e.g. LLaMA, GPT, etc.), the exported dataset can be used directly.

Latest AI Resources # AI Java Open Source Projecct # Large model fine-tuning

Article copyright AI Sharing Circle All, please do not reproduce without permission.

Chai: a social platform that creates a personalized AI chat experience

Latest AI Resources # AI Role Play

7mos ago

047.1K

Youtu-Embedding - Tencent Youtu open source generalized text representation model

Latest AI Resources

2wks ago

010K

Kimi K2-0905 - The latest model release from Dark Side of the Moon!

Latest AI Resources

2mos ago

022.4K

BRIA: Open Platform for Generative AI Images|Image De-Backgrounding|Image Element Editing|RMBG

Latest AI Resources # AI Image Enlargement and Restoration # AI Open Services # AI keying to change backgrounds

11mos ago

030K

No comments

You must be logged in to leave a comment!

No comments...

Easy Dataset: an easy tool for creating fine-tuned datasets for large models

General Introduction

Function List

Using Help

Installation process

Installation via Docker

Running locally through the source code

Main function operation flow

Create a project

Uploading and processing documents

Generating questions and answers

Exporting a data set

Featured Function Operation

Configuring the LLM API

Customized System Tips

Data set optimization

caveat

application scenario

QA

Bebop: The Sales Tool for Finding B2B Prospects Quickly

MarkPDFDown: based on the multimodal model will be converted to PDF Markdown file

Related posts

Chai: a social platform that creates a personalized AI chat experience

Youtu-Embedding - Tencent Youtu open source generalized text representation model

Kimi K2-0905 - The latest model release from Dark Side of the Moon!

BRIA: Open Platform for Generative AI Images|Image De-Backgrounding|Image Element Editing|RMBG

No comments

Latest Collections

Latest Articles

Easy Dataset: an easy tool for creating fine-tuned datasets for large models

General Introduction

Function List

Using Help

Installation process

Installation via Docker

Running locally through the source code

Main function operation flow

Create a project

Uploading and processing documents

Generating questions and answers

Exporting a data set

Featured Function Operation

Configuring the LLM API

Customized System Tips

Data set optimization

caveat

application scenario

QA

Bebop: The Sales Tool for Finding B2B Prospects Quickly

MarkPDFDown: based on the multimodal model will be converted to PDF Markdown file

Related posts

Chai: a social platform that creates a personalized AI chat experience

Youtu-Embedding - Tencent Youtu open source generalized text representation model

Kimi K2-0905 - The latest model release from Dark Side of the Moon!

BRIA: Open Platform for Generative AI Images|Image De-Backgrounding|Image Element Editing|RMBG

No comments

Selected AI Tools

Latest Collections

Latest Articles