[转载]QwQ-32B 的工具调用能力及 Agentic RAG 应用

1.3K 00

背景

近期，一篇名为 Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning 的论文 (arxiv.org/pdf/2503.09516) 引发了广泛关注。该论文提出了一种利用强化学习训练大语言模型 (LLM) 进行推理和利用搜索引擎的新方法。值得注意的是，论文中的一些思路与 Qwen 团队在 QwQ-32B 模型上的探索不谋而合。

阿里巴巴近期发布的 QwQ-32B (qwenlm.github.io/zh/blog/qwq-32b/) 在推理模型中集成了与 Agent 相关的能力。这些能力使得模型可以在使用工具的同时进行批判性思考，并根据环境反馈调整推理过程。在 QwQ-32B 模型文件夹中的 added_tokens.json 文件里，可以看到新增的工具调用和工具响应的特殊 token：

{
"</think>": 151668,
"</tool_call>": 151658,
"</tool_response>": 151666,
"<think>": 151667,
"<tool_call>": 151657,
"<tool_response>": 151665
}

本文将以 Agentic RAG 为例，展示 QwQ-32B 模型在工具调用方面的能力。

Agentic RAG vs. 传统 RAG

为了更好地理解 Agentic RAG 的优势，我们首先需要区分 Agentic RAG 和当前普遍的 RAG 实践范式：

传统 RAG：目前绝大多数 RAG 项目本质上是工作流，即通过预定义的代码路径编排 LLM 和工具的系统。这种人为预先定义的“写死”的工作流由许多相互关联但脆弱的部分组成，如路由、分块、重排序、查询解释、查询扩展、源上下文化和搜索工程等。
- 缺点：人为编排的工作流难以覆盖所有情况 (corner case)。特别是在需要多轮检索等复杂场景下，效果更受限。
Agentic RAG：采用端到端的方式，简化流程。只需为模型配备一个联网检索的 API 工具（本文案例中使用了 Tavily API，有一定的免费额度），其余工作全部由模型自主完成，包括但不限于：
- 意图理解（判断是否需要联网）
- 问题改写或拆分
- 接口调用
- 流程编排（包括是否进行多步检索，以及如何进行多步检索）
- 引用溯源
- ...

简单来说，Agentic RAG 的核心理念是：结构更少，智能更多，少即是多 (Less structure, more intelligence, Less is More)。

正如 Anthropic 对 Agent 模型的定义：类似于 Deep Search，Agent 必须在内部执行目标任务，它们“动态指导自己的过程和工具使用，控制完成任务的方式”。

整体流程

下图展示了 Agentic RAG 的整体流程：

将用户问题适配到提示词模板。
调用模型生成新的 token。如果生成过程中未出现 <tool_call> ... </tool_call>，则直接输出返回结果。
如果出现 <tool_call> ... </tool_call>，则表明模型在推理过程中发起了一个工具调用申请。解析该申请，执行 web_search，并将接口调用结果包装成 <tool_response> ... </tool_response> 的格式，拼接到大模型的上下文中，再次请求大模型生成。
重复执行上述步骤，直到没有更多的 <tool_call>（或达到请求上限）或出现 <|im_end|>。

该流程与 Search-R1 论文中描述的流程基本一致：

关键技术点

提示词模板：

user_question = input('请输入你的问题：')
max_search_times = 5
prompt = f"""You are Qwen QwQ, a curious AI built for retrival augmented generation.
You are at 2025 and current date is {date.today()}.
You have access to the web_search tool to retrival relevant information to help answer user questions.
You can use web_search tool up to {max_search_times} times to answer a user's question, but try to be efficient and use as few as possible.
Below are some guidelines:
- Use web_search for general internet queries, like finding current events or factual information.
- Always provide a final answer in a clear and concise manner, with citations for any information obtained from the internet.
- If you think you need to use a tool, format your response as a tool call with the `action` and `action_input` within <tool_call>...</tool_call>, like this:\n<tool_call>\n{{ "action": "web_search", "action_input": {{ "query": "current stock price of Tesla" }} }}\n</tool_call>.
- After using a tool, continue your reasoning based on the web_search result in <tool_response>...</tool_response>.
- Remember that if you need multi-turn web_search to find relevant information, make sure you conduct all search tasks before you provide a final answer.
---
User Question:{user_question}"""

自定义停止符：
当检测到模型在自回归生成过程中触发了 <tool_call>(.*?)</tool_call>\s*$ 格式（正则表达式匹配）后，停止生成：

from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
StoppingCriteria,
StoppingCriteriaList
)
import torch
import re
tool_call_regex = r"<tool_call>(.*?)</tool_call>\s*$"
end_regex = r"<\|im_end\|\>\s*$"
# 同时监测: <tool_call> 或 <|im_end|>
class RegexStoppingCriteria(StoppingCriteria):
def __init__(self, tokenizer, patterns):
self.patterns = patterns
self.tokenizer = tokenizer
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
decoded_text = self.tokenizer.decode(input_ids[0])
for pattern in self.patterns:
if re.search(pattern, decoded_text, re.DOTALL):
return True
return False
stopping_criteria = StoppingCriteriaList([
RegexStoppingCriteria(
tokenizer,
patterns=[tool_call_regex, end_regex]
)
])
#model.generate(..., stopping_criteria=stopping_criteria) # 加上停止符

网络搜索 API：
本实践中使用的搜索 API 为 Tavily API，该 API 提供一定的免费额度，方便实验和复现。Tavily API 允许开发者通过简单的 API 调用，将网络搜索功能集成到自己的应用中。

实践代码

详细的实践代码，请参考以下链接：

DeepSearch复现篇：QwQ-32B ToolCall功能初探，以Agentic RAG为例.ipynb

测试案例

测试问题：请给我详细介绍下阿里最近开源发布的 QwQ-32B 模型的相关信息。

生成结果：（完整生成结果详见 notebook）

从结果可以看出，推理模型自主完成了意图理解（判断是否需要联网搜索）和搜索关键词生成（问题改写或拆分）。模型还考虑到了潜在的多轮搜索场景。在触发了一次 web search 后，模型根据搜索结果生成了最终报告。

本案例中，模型仅完成了一次搜索接口调用。这可能是由于案例问题较为简单，也可能是基座模型的能力尚不足以触发复杂的多轮搜索。这也说明要充分发挥模型作为智能体的潜力，还是有必要参考 Search-R1 进行后训练(Post Training)，针对性地进行微调。

不过，从 QwQ-32B 模型已经展现的能力来看，结合精心设计的合成（或人工整理）的再训练数据，以及细分场景下的再次强化训练或 SFT，并掩码掉工具接口响应返回的输出 token 对应的损失 (loss)，这种再训练路线有望成为未来智能体开发和部署的主流。通过再训练，可以预先考虑各种行动和边界情况，使得部署更加简单，不再需要人为编排设计工作流。Search-R1 论文中 3.1 节详细介绍了"Loss Masking for Retrieved Tokens" 技术。通过在 PPO 和 GRPO 中，对检索到的 Tokens 进行损失屏蔽，Search-R1 优化了 LLM 生成 Tokens，增强模型与搜索引擎交互和执行推理的能力。

此外，Search-R1 还支持多轮检索和推理（论文中 3.2 节 "Text Generation with Interleaved Multi-turn Search Engine Call"），通过 <search> 和 </search> 触发，并将检索到的内容放到 <information> 和 </information> 之间。而最终答案的输出，则使用 <answer> 和 </answer>。