WebShaper - Ali Tongyi's open source AI training data synthesis system

Latest AI Resources5mos agorelease AI Sharing Circle

49.1K 00

What is WebShaper

WebShaper is an AI training data synthesis system launched by Alibaba Tongyi Labs, which generates high-quality and scalable training data based on formal modeling and intelligence expansion mechanism to help AI intelligences improve complex information retrieval capabilities. The system introduces the concept of "knowledge projection" to construct complex problem structures with aggregate operations and accurately control the complexity of tasks. WebShaper combines supervised fine-tuning and reinforcement learning strategies, allowing the model to perform well in complex tasks, such as document organization, market research, intelligent learning assistant, life decision-making, and medical information query. and life decision-making and medical information query scenarios.

WebShaper's main features

formal modelingThe "knowledge projection" technique based on set theory decomposes a complex information retrieval task into multiple set operations (e.g., intersection, concatenation, etc.), which precisely controls the reasoning path and task complexity and makes the problem structure clearer.
Intelligent Body Extension MechanismBased on the Expander intelligence, it starts with simple "seed problems" and expands into complex reasoning tasks, combining search, summarization and verification tools to ensure that the problem logic is clear and the difficulty of the task is manageable.
High-quality data generation: The generated training data are controllable, interpretable and scalable, breaking through the limitations of traditional pre-retrieved data, reducing errors and redundant information, and improving data quality.
Agent Training Strategies: Combining supervised fine-tuning (SFT) and reinforcement learning (e.g., the GRPO The algorithms, based on high-quality training trajectories and reward mechanisms, guide the model to perform multi-step reasoning, avoiding "shortcuts" or "guessing answers", and improving the model's performance in complex tasks.

WebShaper's official website address

Github repository:: https://github.com/Alibaba-NLP/WebAgent
HuggingFace Model Library:: https://huggingface.co/datasets/Alibaba-NLP/WebShaper
arXiv Technical Paper:: https://arxiv.org/pdf/2507.15061

How to use WebShaper

Access to project resources
- GitHub Repositories: Visit WebShaper's GitHub repository, which provides code, documentation, and sample data.
- Hugging Face dataset: Visit the WebShaper dataset on Hugging Face to download and use the generated training data directly.
environmental preparation
- Installation of dependencies: According to the GitHub repository requirements.txt file installs the necessary Python packages.

pip install -r requirements.txt

- Setting environment variables: If you need to use external tools (e.g. search engines or APIs), make sure that the relevant environment variables are configured correctly.
Running WebShaper::
- Running Expander Intelligence: Start with simple "seed problems" and expand to generate complex problems.

from webshaper.expander import Expander

# 初始化 Expander 智能体
expander = Expander()

# 定义种子问题
seed_question = "2020年NBA总冠军是哪支球队？"

# 逐步扩展问题
expanded_question = expander.expand(seed_question)
print(expanded_question)

- Generate training data: Generate high-quality training data through an extension mechanism.

from webshaper.data_generator import DataGenerator

# 初始化数据生成器
data_generator = DataGenerator()

# 生成训练数据
training_data = data_generator.generate(expanded_question)
print(training_data)

training model: Combining supervised fine-tuning (SFT) and reinforcement learning (e.g. GRPO) to train AI models.

from webshaper.trainer import Trainer

# 初始化训练器
trainer = Trainer()

# 训练模型
model = trainer.train(training_data)

WebShaper's core strengths

High-quality data generation: The generated training data are highly controllable, interpretable, and scalable, and can accurately construct complex problem structures, reducing errors and redundant information.
Formal Modeling of InnovationsWebShaper is based on the set theory concept of "knowledge projection", which allows WebShaper to decompose complex tasks into set operations, precisely controlling the complexity of the tasks and making the structure of the problem clearer.
Intelligent Body Extension MechanismWebShaper's Expander intelligence starts with simple "seed problems" and scales to complex tasks, ensuring logical consistency in problem generation and controlled task difficulty.
Effective Training StrategiesWebShaper's training strategy combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (GRPO) with a reward mechanism to guide the model through multiple inference steps, avoiding "shortcuts" and improving inference.
Wide range of application scenarios: Applicable to a variety of scenarios such as literature organization, market research, intelligent learning assistant, life decision-making and medical information query, providing personalized information support.

Who WebShaper is for

AI researchers: Used to generate high-quality training data, improve the performance of AI models in complex reasoning tasks, and fuel cutting-edge research.
data scientist: Efficiently generate and optimize training data, reduce data labeling and cleaning efforts, and improve model performance.
Natural Language Processing (NLP) Developer: generating complex natural language tasks, improving the model's ability to understand multi-hop reasoning and complex logic, developing intelligent Q&A systems, etc.
Corporate Analyst: Rapidly collect and organize industry data, automatically generate market research tasks, and support decision making.
educator: Generate personalized learning tasks, help students with deep and research-based learning, and develop intelligent learning assistants.