contexts
A recent article titled Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning Thesis (arxiv.org/pdf/2503.09516) has attracted a lot of attention. The paper proposes a new way to train Large Language Models (LLMs) using reinforcement learning for reasoning and utilizing search engines. Notably, some of the ideas in the paper are similar to those developed by Qwen's team in the QwQ-32B The exploration on the model coincides.
Alibaba's recently released QwQ-32B (qwenlm.github.io/zh/blog/qwq-32b/) integrates Agent-related capabilities in the reasoning model. These capabilities allow the model to think critically while using the tool and adjust the reasoning process based on feedback from the environment. In the QwQ-32B model folder the added_tokens.json
file, you can see the special tokens added for tool calls and tool responses:
{
"": 151668,
"": 151658,
"": 151666, "": 151667,
"": 151667.
"": 151657, "</tool_response
"": 151665
}
In this paper, we will use Agentic RAG As an example, the capabilities of the QwQ-32B model in terms of tool calling are demonstrated.
Agentic RAG vs. Traditional RAG
In order to better understand the benefits of Agentic RAG, we first need to distinguish between Agentic RAG and the current prevalent RAG practice paradigm:
- Traditional RAG: The vast majority of RAG projects today are essentially workflows, i.e., systems that orchestrate LLMs and tools through predefined code paths. This artificially pre-defined, "written to death" workflow consists of many interrelated but fragile parts such as routing, chunking, reordering, query interpretation, query expansion, source contextualization, and search engineering.
- drawbacks: It is difficult to cover all the corner cases in a human-organized workflow. Especially in complex scenarios that require multiple rounds of retrieval, the effect is more limited.
- Agentic RAG: An end-to-end approach is used to simplify the process. Just equip the model with an API tool for networked retrieval (the Tavily API was used in this case, with a certain amount of free credits), and the rest of the work is all done by the model autonomously, including but not limited to:
- Intent to understand (determine if networking is required)
- Rewriting or splitting the question
- interface call
- Process choreography (including whether and how to conduct multi-step searches)
- cite sth. as the source of sth.
- ...
Simply put, the core concept of Agentic RAG is:Less structure, more intelligence, Less is MoreThe
precisely as Anthropic Definition of the Agent Model: Similar to Deep Search, Agents must perform the target task internally, and they "dynamically direct their own processes and use of tools to control how the task is accomplished".
Overall process
The following figure illustrates the overall flow of Agentic RAG:
- Adapt user questions to prompt word templates.
- Calls the model to generate a new token. if the generation process does not result in a
... </tool_call
, then the return result is output directly. - In the event of
... </tool_call
, then it indicates that the model initiated a tool call request during the reasoning process. Parsing this request, executingweb_search
and wraps the results of the interface call in... </tool_response
format, spliced into the context of the macromodel, and requested again for macromodel generation. - Repeat the above steps until there are no more
(or the request limit is reached) or the presence of
The
The process is essentially the same as that described in the Search-R1 paper:
Key Technology Points
- Cue word templates::
user_question = input('Please enter your question:')
max_search_times = 5
prompt = f"""You are Qwen QwQ, a curious AI built for retrival augmented generation.
You are at 2025 and current date is {date.today()}.
You have access to the web_search tool to retrival relevant information to help answer user questions.
You can use web_search tool up to {max_search_times} times to answer a user's question, but try to be efficient and use as few as possible.
Below are some guidelines.
- Use web_search for general internet queries, like finding current events or factual information.
- Always provide a final answer in a clear and concise manner, with citations for any information obtained from the internet.
- If you think you need to use a tool, format your response as a tool call with the `action` and `action_input` within ... , like this:\n\n{{ "action": "web_search", "action_input": {{ "query": "current stock price of Tesla" }} }}\n .
- After using a tool. continue your reasoning based on the web_search result in ... .
- Remember that if you need multi-turn web_search to find relevant information, make sure you conduct all search tasks before you provide a final answer.
---
User Question:{user_question}""""
- Custom stop signs::
When it is detected that the model triggers an autoregressive generation process during the
(. *?) \s*$
The generation stops after the format (regular expression match):
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
StoppingCriteria, StoppingCriteriaList, StoppingCriteriaList, StoppingCriteriaList
StoppingCriteriaList
)
import torch
import re
tool_call_regex = r"(. *?) \s*$"
end_regex = r"\s*$"
# Simultaneous Monitoring: or
class RegexStoppingCriteria(StoppingCriteria).
def __init__(self, tokenizer, patterns).
self.patterns = patterns
self.tokenizer = tokenizer
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool.
decoded_text = self.tokenizer.decode(input_ids[0])
for pattern in self.patterns:: if re.search(pattern, decode).decode(input_ids[0])
if re.search(pattern, decoded_text, re.DOTALL):: if re.search(pattern, decoded_text, re.DOTALL).
return True
return False
stopping_criteria = StoppingCriteriaList([)
RegexStoppingCriteria(
tokenizer, patterns=[tool_call_criteria
patterns=[tool_call_regex, end_regex]
)
])
#model.generate(... , stopping_criteria=stopping_criteria) # add stopping_criteria
- Web Search API::
The search API used in this practice is the Tavily API, which offers a certain amount of free credits to facilitate experimentation and replication.The Tavily API allows developers to integrate web search functionality into their applications through simple API calls.
Practice Code
For detailed practice code, please refer to the following link:
Test Cases
Testing Issues: Please give me more information about the QwQ-32B model recently released in open source by Ali.
Generate results: (see notebook for full results)
As can be seen from the results, the inference model autonomously performs intent understanding (determining whether a networked search is required) and search keyword generation (question rewriting or splitting). The model also takes into account potential multi-round search scenarios. After triggering a web search
Afterward, the model generated a final report based on the search results.
In this case, the model completed only one search interface call. This may be due to the simplicity of the case problem, or the fact that the base model is not yet capable enough to trigger complex multi-round searches. This also shows that to fully utilize the potential of the model as an intelligent body, it is still necessary to refer to Search-R1 for post training and targeted fine-tuning.
However, from the capabilities already demonstrated by the QwQ-32B model, the combination of well-designed synthetic (or manually sorted) retraining data, as well as re-enforcement training or SFT in segmented scenarios, and masking out the output returned by the tool interface response token This retraining route is expected to become the mainstream of future intelligence development and deployment, with the corresponding loss. Through retraining, various actions and boundary cases can be considered in advance, making deployment simpler and eliminating the need for human-orchestrated design workflows. section 3.1 of the Search-R1 paper describes in detail the "Loss Masking for Retrieved Tokens" Technology. This is accomplished through the use of the PPO and GRPO in which the retrieved Tokens are loss-masked, Search-R1 optimizes the LLM to generate Tokens, enhancing the model's ability to interact with search engines and perform inference.
In addition, Search-R1 supports multi-round retrieval and inference (Section 3.2, "Text Generation with Interleaved Multi-turn Search Engine Call" in the paper) through the cap (a poem)
trigger and puts the retrieved content into the
cap (a poem)
between. The output of the final answer, on the other hand, uses the
cap (a poem)
The