AI Personal Learning
and practical guidance
讯飞绘镜

Chinese based full-blooded DeepSeek-R1 distillation dataset, supports Chinese R1 distillation SFT dataset

General Introduction

The Chinese DeepSeek-R1 distillation dataset is an open source Chinese dataset containing 110K pieces of data designed to support machine learning and natural language processing research. The dataset is released by Liu Cong NLP team, and the dataset contains not only math data, but also a large number of general types of data, such as logical reasoning, Xiaohongshu, Zhihu and so on. The distillation process of the dataset is strictly in accordance with the details provided by DeepSeek-R1 official to ensure the high quality and diversity of the data. Users can download and use the dataset for free on the Hugging Face and ModelScope platforms.

中文基于满血 DeepSeek-R1 蒸馏数据集,支持中文R1蒸馏SFT数据集-1


 

Function List

  • Multiple data types: Includes math, logical reasoning, general-purpose type data, and more.
  • High-quality data: Distilled in strict accordance with the official details provided by DeepSeek-R1.
  • free and open source: Users can download it for free on the Hugging Face and ModelScope platforms.
  • Supports multiple applications: Applicable to a wide range of research areas such as machine learning and natural language processing.
  • Detailed data distribution: Provides detailed categorization of data and quantitative information.

 

Using Help

Installation process

  1. Visit the Hugging Face or ModelScope platforms.
  2. Search for "Chinese-DeepSeek-R1-Distill-data-110k".
  3. Click on the download link and select the appropriate format for downloading.

Usage

  1. Load Data Set: in the Python environmentdatasetsThe library loads the dataset.
   from datasets import load_dataset
dataset = load_dataset("Congliu/Chinese-DeepSeek-R1-Distill-data-110k")
  1. View Data: UsedatasetObjects view basic information and samples of the dataset.
   print(dataset)
print(dataset['train'][0])
  1. Data preprocessing: Pre-processing of data according to research needs, such as word splitting and de-duplication.
   from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
tokenized_data = dataset.map(lambda x: tokenizer(x['text'], padding='max_length', truncation=True))
  1. model training: Model training using preprocessed data.
   from transformers import BertForSequenceClassification, Trainer, TrainingArguments
model = BertForSequenceClassification.from_pretrained('bert-base-chinese')
training_args = TrainingArguments(output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16)
trainer = Trainer(model=model, args=training_args, train_dataset=tokenized_data['train'])
trainer.train()

Featured Functions Operation Procedure

  1. Mathematical data processing: For math type data, add the prompt "Please reason step by step and put the final answer in \boxed {}".
   def add_math_prompt(example):
example['text'] = "请一步步推理,并把最终答案放到 \\boxed {}。" + example['text']
return example
math_data = dataset.filter(lambda x: x['category'] == 'math').map(add_math_prompt)
  1. Logical Reasoning Data Processing: Special handling of logical reasoning data to ensure logical and consistent data.
   def process_logic_data(example):
# 自定义逻辑处理代码
return example
logic_data = dataset.filter(lambda x: x['category'] == 'logic').map(process_logic_data)
May not be reproduced without permission:Chief AI Sharing Circle " Chinese based full-blooded DeepSeek-R1 distillation dataset, supports Chinese R1 distillation SFT dataset
en_USEnglish