ToolGen is a framework for integrating tool knowledge directly into large-scale language models (LLMs), enabling seamless tool invocation and language generation by representing each tool as a unique token. It was developed by Renxi Wang et al. to improve the performance of tool retrieval and task completion.
- Tool tokenization: convert tools into unique tokens for easy model invocation.
- Tool call generation: The model is capable of generating tool calls and parameters.
- Task completion: automation of complex tasks through tool calls.
- Dataset Support: Provides rich datasets to support model training and evaluation.
summaries
As Large Language Models (LLMs) evolve, their limitation of not being able to autonomously perform tasks by directly interacting with external tools becomes particularly evident. Traditional approaches rely on using tool descriptions as context input, which is limited by context length and requires a separate retrieval mechanism that is often inefficient. We propose ToolGen, a method for retrieving tool descriptions by representing each tool as a unique Token , a paradigm that integrates tool knowledge directly into LLM parameters. This allows LLMs to seamlessly integrate tool invocations with language generation by using tool invocations and parameters as part of the predictive capabilities of their next Token. Our framework allows LLMs to access and use a large number of tools without additional retrieval steps, significantly improving performance and scalability. Experimental results based on over 47,000 tools show that ToolGen not only achieves superior results in tool retrieval and autonomous task completion, but also lays the foundation for a new generation of AI agents that can adapt to a wide range of domain tools. By fundamentally transforming tool retrieval into a generative process, ToolGen paves the way for more flexible, efficient, and autonomous AI systems.ToolGen extends the utility of LLM by supporting end-to-end tool learning and providing opportunities for integration with other advanced technologies such as Chain Thinking and Reinforcement Learning.
1 Introduction
Large Language Models (LLMs) have demonstrated impressive capabilities in processing external inputs, performing operations, and autonomously accomplishing tasks (Gravitas, 2023; Qin et al. 2023; Yao et al. 2023; Shinn et al. 2023; Wu et al. 2024a; Liu et al. 2024). Among the various approaches to enable LLMs to interact with the external world, tool calls via APIs have become one of the most common and effective. However, as the number of tools increases into the tens of thousands, existing methods of tool retrieval and execution are difficult to scale efficiently.
In real-world scenarios, a common approach is to combine tool retrieval with tool execution, i.e., the retrieval model first filters out the relevant tools and then hands them over to the LLM for final selection and execution (Qin et al., 2023; Patil et al., 2023). While this combined approach is helpful when dealing with a large number of tools, there are obvious limitations: retrieval models often rely on small encoders that make it difficult to comprehensively capture the semantics of complex tools and queries, while separating retrieval from execution may lead to inefficiencies and stage bias during task completion.
Furthermore, LLMs and their disambiguators are pre-trained primarily on natural language data (Brown et al., 2020; Touvron et al., 2023), and have limited intrinsic knowledge of tool-related functionality of their own. This knowledge gap leads to poor performance, especially when LLMs must rely on retrieved tool descriptions for decision making.
In this study, we introduce ToolGen, a new framework that integrates real-world tool knowledge directly into LLM parameters and transforms tool retrieval and execution into a unified generation task. Specifically, ToolGen makes more effective use of LLM's preexisting knowledge for tool retrieval and invocation by extending the LLM vocabulary to represent tools as specific virtual tokens and training the model to generate these tokens in a dialog context.
Specifically, each tool is represented as a unique virtual Token in the LLM vocabulary.Based on the pre-trained LLM, the training process of ToolGen consists of three phases: tool memorization, retrieval training and agent training. During the tool memorization phase, the model associates each virtual tool Token with its document. During retrieval training, the model learns to generate relevant ToolTokens based on user queries.Finally, in end-to-end agent tuning, the model is trained to act as an autonomous agent, generating plans and tools and determining the appropriate parameters for accomplishing tasks. By invoking tools and obtaining feedback from the external environment, the model can efficiently and integratively process user queries. Figure 1 shows how ToolGen compares to the traditional paradigm.
We validate ToolGen's superiority in two scenarios: a tool retrieval task where the model retrieves the correct tool for a given query, and an LLM-based agent task where the model accomplishes complex tasks involving real API calls. Utilizing a dataset of 47,000 real-world tools, ToolGen performs comparably to leading tool retrieval methods, but at significantly lower cost and with greater efficiency. In addition, it goes beyond the traditional tool learning paradigm, highlighting its potential for advancing more effective tool use systems.
Figure 1: Comparison of ToolGen with previous retrieval-based approaches. Previous approaches use retrievers to retrieve relevant tools by means of similarity matching, and then place these tools into LLM prompts for selection.ToolGen enables tool retrieval by directly generating tool Token, and can accomplish the task without relying on any external retrievers.
ToolGen represents a new paradigm for tool interaction, fusing retrieval and generation into a unified model. This innovation lays the foundation for a new generation of AI agents that can adapt to a wide range of domain tools. In addition, ToolGen creates new opportunities for combining advanced techniques such as chained-thinking reasoning and reinforcement learning with a unified generative approach for tool use, extending the capabilities of large language models in real-world applications.
To summarize, our contributions include:
- A novel framework, ToolGen, is proposed that integrates tool retrieval and execution into the process of generating large language models through virtual tokens.
- A three-phase training process was designed to enable ToolGen to be efficient and scalable in terms of tool retrieval and API calls.
- Experimental validation shows that ToolGen exhibits comparable performance over the best available tool retrieval methods in large-scale tool repositories, but at a lower cost, more efficiently, and beyond the traditional tool learning paradigm.
2 Related work
2.1 Tool retrieval
Tool retrieval is crucial in the actual task execution of large language model agents, where tools are usually described by their documents. Traditional approaches such as sparse retrieval (e.g. BM25 (Robertson et al., 2009)) and dense retrieval (e.g., DPR (Karpukhin et al., 2020), ANCE (Xiong et al., 2021)) rely on large document indexes and external modules, leading to inefficiencies and difficulties in optimizing in an end-to-end agent framework. Several studies have explored alternative approaches. For example, Chen et al. (2024b) rewrite the query and extract its intent towards an unsupervised retrieval setting, although the results are not as good as supervised approaches. xu et al. (2024) propose an approach that improves retrieval accuracy but increases latency by iteratively optimizing the query based on tool feedback.
Recently, generative retrieval has emerged as a promising new paradigm where models directly generate relevant document identifiers rather than relying on traditional retrieval mechanisms (Wang et al., 2022; Sun et al., 2023; Kishore et al., 2023; Mehta et al., 2023; Chen et al., 2023b). Inspired by this, ToolGen represents each tool as a unique Token so that tool retrieval and invocation can be handled as generative tasks. In addition to simplifying retrieval, this design can be used in conjunction with other large language models and large language model-based agent functions (e.g., Chained Thought Reasoning (Wei et al., 2023) and ReAct (Yao et al., 2023)) is smoothly integrated. By integrating retrieval and task execution into a single large language modeling agent, it reduces latency and computational overhead and improves the efficiency and effectiveness of task completion.
2.2 Large Language Model Agent with Tool Calls
Large language models show strong potential for mastering the tools required for various tasks. However, most existing research has focused on a limited set of actions (Chen et al., 2023a; Zeng et al., 2023; Yin et al., 2024; Wang et al., 2024). For example, Toolformer (Schick et al., 2023) fine-tunes GPT-J to handle only five tools (e.g., calculators). While this approach works well in narrow tasks, it has difficulties in realistic scenarios containing large amounts of action space.ToolBench (Qin et al., 2023) extends the study by introducing over 16,000 tools, emphasizing the challenges of tool selection in complex environments.
To perform tool selection, current approaches typically employ a retrieval-generation pipeline, where the large language model first retrieves the relevant tool and then utilizes it (Patil et al., 2023; Qin et al., 2023). However, the pipeline approach faces two main problems: mis-passing of the retrieval step, and the difficulty for the large language model to fully understand and utilize the tools through simple prompts.
To alleviate these problems, researchers have attempted to represent actions as Token and transform action prediction into generative tasks. For example, RT2 (Brohan et al., 2023) generates Token representing robot actions, and Self-RAG (Asai et al., 2023) uses special Token to decide when to retrieve a document.ToolkenGPT (Hao et al., 2023) introduces tool-specific Token to trigger the tool's use, a concept closest to our approach. concept is closest to our approach.
Our approach differs significantly from ToolkenGPT. First, we focus on real tools that require flexible parameters for complex tasks (e.g., YouTube channel search), whereas ToolkenGPT is limited to simpler tools with fewer inputs (e.g., math functions with two numbers). In addition, ToolkenGPT relies on fewer sample cues, whereas ToolGen integrates tool knowledge directly into the larger language model through full parameter fine-tuning, allowing it to autonomously retrieve and perform tasks. Finally, our experiments involve a much larger set of tools-47,000 tools, compared to ToolkenGPT's 13-300.
Other studies such as ToolPlanner (Wu et al., 2024b) and AutoACT (Qiao et al., 2024) have used reinforcement learning or multi-agent systems to enhance tool learning or task completion (Qiao et al., 2024; Liu et al., 2023; Shen et al., 2024; Chen et al., 2024a). We do not compare these approaches to our model for two reasons: (1) most of these efforts rely on feedback mechanisms such as Reflection (Shinn et al., 2023) or reward models, which are similar to the evaluation design of ToolBench, where a large language model serves as an evaluator and does not have access to the real answers. However, this is not the focus of our study, and our end-to-end experiments do not rely on such feedback mechanisms. (2) Our approach does not conflict with these methods, but rather can be integrated. The exploration of such integration is left for future research.
3 ToolGen
In this section, we first introduce the symbolic representation used in the paper. Then, we describe in detail the specific approach of ToolGen, including tool virtualization, tool memory, retrieval training, and end-to-end agent tuning, as shown in Figure 2. Finally, we present our inference methodology.
Figure 2: Schematic diagram of the ToolGen framework. In the tool virtualization phase, tools are mapped to virtual tokens, and in the next three training phases, ToolGen first memorizes tools by predicting tool tokens from tool documents. Next, ToolGen learns to retrieve tools by predicting tool tokens from queries. Finally, pipeline data (i.e., trajectories) are used to fine-tune the retrieval model in the last stage, generating the ToolGen Agent model.
3.1 Preparatory knowledge
For a given user query q, the goal of tool learning is to solve q by using tools from a large set of tools D={d1,d2,...,dN}, where |D|=N is a large number, which makes it impractical to include all tools in D in a large language modeling environment. Therefore, current research typically uses a retriever R to retrieve k relevant tools from D, denoted as Dk,R={dr1,dr2,...,drk}=R(q,k,D), where |Dk,R|≪N. The final prompter is a concatenation of q and Dk,R, denoted as Prompt=[q,Dk,R]. To accomplish a task (query), agents based on large language models typically employ a four-stage iteration (Qu et al., 2024 ): generating a plan pi , selecting a tool dsi , determining tool parameters ci , and collecting information from tool feedback fi . We denote these steps as pi,dsi,ci,fi for the i-th iteration. The model will continue to iterate through these steps until the task is completed, at which point the final answer a is generated. The entire trajectory can be represented as Traj=[Prompt,(p1,ds1,c1,f1),...,(pt,dst,ct,ft),a]=[q,R (q,D),(p1,ds1,c1,f1),...,(pt,dst,ct,ft),a]. This iterative approach allows the model to dynamically adjust and refine its operations at each step based on the feedback received, thus improving its performance in accomplishing complex tasks.
3.2 Tool virtualization
In ToolGen, we implement tool virtualization by mapping each tool to a unique new Token through an approach called atomic indexing. In this approach, a unique Token is assigned to each tool by extending the vocabulary of a large language model. the embedding of each tool Token is initialized to the average embedding of its corresponding tool name, thus providing a semantically meaningful starting point for each tool.
Formally, a Token set is defined as T = Index(d)|∀d ∈ D, where Index is a function that maps tools to Token. We show that atomic indexing is more efficient and reduces phantom phenomena compared to other indexing methods (e.g., semantic and numeric mappings; see 4.3 and 5.4 for discussion).
3.3 Tool Memory
After assigning Token to a tool, the big language model still lacks any knowledge about the tool. To address this problem, we fine-tune the tool description by using it as input and its corresponding Token as output, referring to this process as tool memorization. We use the following loss function:
ℒtool=∑d∈D-logpθ(Index(d)|ddoc) |
where θ denotes the large language model parameters and ddoc denotes the tool description. This step provides the basics of the tool and related operations for the large language model.
3.4 Search training
We then train the large language model to associate the hidden space of virtual tool Token (and its documents) with the space of user queries, so that the model can generate the correct tool based on the user's query. To this end, we fine-tune the big language model by taking user queries as input and corresponding tool Token as output:
ℒretrieval=∑q∈Q∑d∈Dqq-logpθ′(Index(d)|q) |
where θ′ denotes the large language model parameter after tool memorization, Q is the set of user queries, and Dq is the set of tools associated with each query. This process generates the ToolGen retriever, enabling it to generate the appropriate Tool Token for a given user query.
3.5 End-to-End Agent Tuning
After retrieval training, the Big Language model is able to generate Tool Token from the query.In the final stage, we use the Agent-Flan task to complete the trajectory fine-tuning model. We adopt an Agent-Flan-like (Chen et al., 2024c) inference strategy, where our pipeline uses an iterative process where the model first generates the Thought, and then generates the corresponding Action Token, which is used to fetch the tool document, which is then used by the big language model to generate the necessary parameters. The process iterates until the model generates a "done" token or reaches the maximum number of rounds. The generated trajectory is denoted as Traj=[q,(p1,Index(ds1),c1,f1),...,(pt,Index(dst),ct,ft),a]. In this structure, the association tool is no longer needed.
3.6 Reasoning
During the reasoning process, a large language model may generate action Token that exceeds the predefined set of tool Token.To prevent this, we devise a constraint bundle search generation method that restricts the output Token to the set of tool Token. We apply this constrained bundle search to both tool retrieval (where the model is based on a query selection tool) and end-to-end agent systems, effectively reducing illusions in the action generation step. See 5.4 for a detailed analysis. See Appendix C for implementation details.
4 Tool retrieval assessment
4.1 Experimental setup
We use the pre-trained Llama-3-8B model (Dubey et al., 2024) as the base model with a vocabulary of 128,256. 46,985 Token were added during tool virtualization through an atomic indexing approach, resulting in a final vocabulary of 175,241. we fine-tuned the model using the Llama-3 chat template with a cosine learning rate scheduler and applied a warm-up step of 3%. The maximum learning rate was 4 × 10-5. All models were passed through Deepspeed ZeRO 3 on 4 × A100 GPUs (Rajbhandari et al.2020) for training. Eight rounds of tool memorization and one round of retrieval training were conducted.
data set
Our experiments are based on ToolBench, a real-world tool benchmark containing more than 16,000 tool collections, each containing multiple APIs, for a total of about 47,000 unique APIs.Each API is documented using a dictionary containing the API name, description, and invocation parameters. See the Appendix for a real-world example AWe treat each API as an operation and map it to a token. We treat each API as an operation and map it to a Token. our retrieval and end-to-end agent tuning data is transformed from ToolBench's raw data. See the Appendix for more information GThe following is an example of a tool. Although each tool may contain multiple APIs, for simplicity this document refers to each API collectively as a tool.
We follow the data partitioning approach of Qin et al. (2023) and categorize the 200,000 pairs (queries, associated APIs) into three classes: I1 (single-tool queries), I2 (multi-tool queries within classes), and I3 (multi-tool commands within collections), containing 87,413, 84,815, and 25,251 instances, respectively.
Baseline methodology
We compare ToolGen to the following benchmarks:
- BM25: A classical TF-IDF-based unsupervised retrieval method based on the word similarity between a query and a document.
- Embedding similarity (EmbSim): sentence embeddings generated using OpenAI's sentence embedding model; specifically text-embedding-3-large used in our experiments.
- Re-Invoke (Chen et al., 2024b): an unsupervised retrieval method incorporating query rewriting and document expansion.
- IterFeedback (Xu et al., 2024): a BERT-based retriever that uses gpt-3.5-turbo-0125 as a feedback model for up to 10 rounds of iterative feedback.
- ToolRetriever (Qin et al., 2023): a BERT-based retriever trained by contrast learning.
set up
We conduct experiments in two settings. In the first setting, In-Domain Retrieval restricts the retrieval of tools whose search space is within the same domain. For example, when evaluating queries in the I1 domain, it is limited to tools in I1. This setting is consistent with the ToolBench setting. The second setting, Multi-Domain Retrieval, is more complex, where the search space is extended to tools in all three types of domains. In this case, the model is trained on merged data, increasing the search space and task complexity. Unlike ToolBench, the multi-domain setting reflects realistic scenarios where retrieval tasks may involve overlapping or mixed domains. This setting assesses the model's ability to generalize across domains and handle more complex and diverse retrieval tasks.
norm
We evaluated retrieval performance using the Normalized Discount Cumulative Gain (NDCG) (Järvelin & Kekäläinen, 2002), a widely used metric in ranking tasks, including tool retrieval.The NDCG takes into account both the relevance of the retrieved tools and the ranking position.
Table 1: Tool retrieval evaluation in two settings: (1) intra-domain retrieval, where models are trained and evaluated on the same domains, and (2) multi-domain retrieval, where models are trained on all domains and evaluated on the complete set of tools from all domains.BM25, EmbSim, and Re-Invoke are untrained, unsupervised benchmarking methods.IterFeedback is a retrieval system with multiple models and feedback mechanisms.ToolRetriever is trained using contrast learning, while ToolGen is trained using next Token prediction. Items with * in the results indicate models not implemented by us, and the data is from the original paper and thus listed only under the in-domain setting. For ToolGen in the in-domain setting, we allow the generation space to include all Token, which is a more challenging scenario compared to other models. The optimal results in each category are marked in bold.
mould | I1 | I2 | I3 | ||||||
---|---|---|---|---|---|---|---|---|---|
NDCG1 | NDCG3 | NDCG5 | NDCG1 | NDCG3 | NDCG5 | NDCG1 | NDCG3 | NDCG5 | |
region | |||||||||
BM25 | 29.46 | 31.12 | 33.27 | 24.13 | 25.29 | 27.65 | 32.00 | 25.88 | 29.78 |
EmbSim | 63.67 | 61.03 | 65.37 | 49.11 | 42.27 | 46.56 | 53.00 | 46.40 | 52.73 |
Re-Invoke* | 69.47 | - | 61.10 | 54.56 | - | 53.79 | 59.65 | - | 59.55 |
IterFeedback* | 90.70 | 90.95 | 92.47 | 89.01 | 85.46 | 87.10 | 91.74 | 87.94 | 90.20 |
ToolRetriever | 80.50 | 79.55 | 84.39 | 71.18 | 64.81 | 70.35 | 70.00 | 60.44 | 64.70 |
ToolGen | 89.17 | 90.85 | 92.67 | 91.45 | 88.79 | 91.13 | 87.00 | 85.59 | 90.16 |
multidomain | |||||||||
BM25 | 22.77 | 22.64 | 25.61 | 18.29 | 20.74 | 22.18 | 10.00 | 10.08 | 12.33 |
EmbSim | 54.00 | 50.82 | 55.86 | 40.84 | 36.67 | 39.55 | 18.00 | 17.77 | 20.70 |
ToolRetriever | 72.31 | 70.30 | 74.99 | 64.54 | 57.91 | 63.61 | 52.00 | 39.89 | 42.92 |
ToolGen | 87.67 | 88.84 | 91.54 | 83.46 | 86.24 | 88.84 | 79.00 | 79.80 | 84.79 |
4.2 Results
Table 1 shows the results of the tool retrieval. As expected, all trained models significantly outperform the untrained baseline (BM25, EmbSim, and Re-Invoke) on all metrics, showing the advantages of training on tool retrieval data.
Our proposed ToolGen model consistently performs best in both settings. In the in-domain setting, ToolGen provides highly competitive results, with performance comparable to that of the IterFeedback system using multiple models and feedback mechanisms. As a single model, ToolGen significantly outperforms ToolRetriever on all metrics and even outperforms IterFeedback in several scenarios (e.g., NDCG@5 for domain I1 and NDCG@1, @3, @5 for I2).
In the multidomain setting, ToolGen remains robust, outperforming ToolRetriever and maintaining the lead among the baseline models, despite the larger search space and the usual drop in overall performance. This shows that ToolGen, despite being a single model, can still compete with complex retrieval systems like IterFeedback, demonstrating its ability to handle complex real-world retrieval tasks with unclear domain boundaries.
4.3 Comparison of Indexing Methods
While ToolGen uses atomic indexes for tool virtualization, we also explore several alternative generative retrieval methods. In this section, we compare them to the following three approaches:
- Digital: maps each tool to a unique number. The generated Token is purely digital, providing no intrinsic semantic information, but a unique identifier for each tool.
- Hierarchical: This method clusters tools into non-overlapping groups and recursively divides these clusters to form a hierarchical structure. An index from the root node to the leaf nodes in this structure represents each tool, similar to the Brown clustering technique.
- Semantics: In this approach, each tool is mapped to its name, which guides the Large Language Model (LLM) through the semantic content of the tool name. The tool name directly provides a meaningful representation related to its function.
Figure 3: The distribution of the number of sub-Tokens per tool differs under each indexing method. Atomic indexing ensures that each tool is a single Token, while numeric indexing encodes tools into N Token, i.e., the tool number lies in the range (10N-1,10N]. In contrast, semantic indexing and hierarchical indexing produce variable numbers of sub-Tokens, with semantic indexing having a longer sequence of sub-Tokens and more outliers.
Specific implementation details are described in Appendix B.
First, we analyzed the number of sub-Tokens required by each method to represent each tool, as shown in Figure 3. The graph highlights the superiority of atomic indexing, where each tool is represented by a single Token, whereas the other methods require multiple Token. this efficiency allows ToolGen to reduce the number of generated Token and shorten the reasoning time in retrieval and agent scenarios.
Next, we examined the effectiveness of the different indexing methods. As shown in Table 2, semantic indexing exhibits the best retrieval performance across a number of metrics and scenarios, while atomic indexing comes in a close second in many cases. We attribute this to the fact that semantic indexing better matches the pre-trained data of the large language model. However, this advantage diminishes as the training data and types increase. For example, in Section 5.3, we show that atomic indexes perform better on end-to-end results. Taking these factors into account, we choose atomic indexes for ToolGen tool virtualization.
Table 2: Retrieval evaluations of different indexing methods in the multidomain setting. The best results are bolded and the second best results are underlined.
mould | I1 | I2 | I3 | ||||||
---|---|---|---|---|---|---|---|---|---|
NDCG1 | NDCG3 | NDCG5 | NDCG1 | NDCG3 | NDCG5 | NDCG1 | NDCG3 | NDCG5 | |
numeric | 83.17 | 84.99 | 88.73 | 79.20 | 79.23 | 83.88 | 71.00 | 74.81 | 82.95 |
lamination | 85.67 | 87.38 | 90.26 | 82.22 | 82.70 | 86.63 | 78.50 | 79.47 | 84.15 |
meaning of words | 89.17 | 91.29 | 93.29 | 83.71 | 84.51 | 88.22 | 82.00 | 78.86 | 85.43 |
atomic | 87.67 | 88.84 | 91.54 | 83.46 | 86.24 | 88.84 | 79.00 | 79.80 | 84.79 |
Table 3: Ablation studies for tool retrieval. The effects of removing retrieval training, tool memory, and constrained beam search on ToolGen performance are evaluated separately.
mould | I1 | I2 | I3 | ||||||
---|---|---|---|---|---|---|---|---|---|
NDCG1 | NDCG3 | NDCG5 | NDCG1 | NDCG3 | NDCG5 | NDCG1 | NDCG3 | NDCG5 | |
ToolGen | 87.67 | 88.84 | 91.54 | 83.46 | 86.24 | 88.84 | 79.00 | 79.80 | 84.79 |
-Memory | 84.00 | 86.77 | 89.35 | 82.21 | 83.20 | 86.78 | 77.00 | 77.71 | 84.37 |
-Retrieval training | 10.17 | 12.31 | 13.89 | 5.52 | 7.01 | 7.81 | 3.00 | 4.00 | 4.43 |
-Binding | 87.67 | 88.79 | 91.45 | 83.46 | 86.24 | 88.83 | 79.00 | 79.93 | 84.92 |
4.4 Ablation experiments
We conducted ablation experiments to evaluate the effects of different training phases of the ToolGen, as shown in Table 3 Shown. The results show that retrieval training is a key factor affecting the performance of tool retrieval, as it directly aligns the retrieval task where the input is a query and the output is a tool Token. The removal of tool memories results in a slight decrease in performance, although it is helpful in improving generalization performance, as we will show in Appendix F discussed further in. Similarly, restricted bundle search, while contributing little to the retrieval task, helps to prevent hallucination generation and thus has some value in end-to-end agent tasks, see Section 5.4.
5 End-to-end assessment
5.1 Experimental setup
We have made several changes to the trace data from ToolBench to adapt it to the ToolGen framework. For example, since ToolGen does not require explicit selection of relevant tools as input, we removed this information from the system prompts. For more details, see Appendix G. Based on this, we fine-tuned the retrieval model using the reformatted data to generate an end-to-end ToolGen agent.
baseline model
- GPT-3.5: We use gpt-3.5-turbo-0613 as one of the baseline models. The implementation is the same as StableToolBench (Guo et al., 2024), where the tool-calling functionality of GPT-3.5 is used to form tool agents.
- ToolLlama-2: Qin et al. (2023) introduced ToolLlama-2 by fine-tuning the Llama-2 (Touvron et al., 2023) model on ToolBench data.
- ToolLlama-3: To ensure a fair comparison, we created the ToolLlama-3 baseline model by fine-tuning the same base model Llama-3 as ToolGen on the ToolBench dataset. For the remainder of this paper, ToolLlama-3 is referred to as ToolLlama to distinguish it from ToolLlama-2.
set up
- Use of Genuine Tools (G.T.): referring to Qin et al. (2023), we define the Genuine Tools for a query as ChatGPT Selected tool. For ToolLlama, we entered the real tool directly in the prompt, in the same format as its training data. For ToolGen, since it was not trained on the data of the pre-selected tool, we added the prefix: I am using the following tool: [tool tokens] in the planning phase, where [tool tokens] is the virtual Token corresponding to the real tool.
- Use of Retriever: In the end-to-end experiments, we use a retrieval-based setup. For the baseline model, we use the tools retrieved by ToolRetriever as the associated tools. ToolGen, on the other hand, generates tool tokens directly and therefore does not use a retriever.
All models are fine-tuned using a cosine scheduler with the maximum learning rate set to 4 × 10-5. The context length is truncated to 6,144 and the total batch size is 512. We further use Flash-Attention (Dao et al., 2022; Dao, 2024) and Deepspeed ZeRO 3 ( Rajbhandari et al., 2020) to save memory.
ToolGen and ToolLlama follow different paradigms to accomplish the task; ToolLlama generates ideas, actions, and parameters in a single round, while ToolGen separates these steps. For ToolGen, we set a cap of up to 16 rounds, allowing for 5 rounds of actions and 1 round of final answers. We compare this to ToolLlama's limit of 6 rounds.
In addition, we introduced a retry mechanism for all models to prevent premature termination, the details of which are described in Appendix D. Specifically, if a model generates a response that contains "give up" or "I'm sorry", we prompt the model to regenerate the response at a higher temperature.
Assessment of indicators
For end-to-end evaluation, we use the stable tool evaluation benchmark StableToolBench (Guo et al., 2024), which selects solvable queries from ToolBench and simulates the output of failed tools using GPT-4 (OpenAI, 2024). We evaluate performance using two metrics: solvable pass rate (SoPR), which is the percentage of queries successfully solved, and solvable win rate (SoWR), which indicates the percentage of answers that outperform the reference model (GPT-3.5 in this study). In addition, we provide the micro-averaged scores for each category.
Table 4: Performance of end-to-end evaluation of unseen commands in two settings. For the R. setting, GPT-3.5 and ToolLlama use ToolRetriever, while ToolGen does not use an external retriever. All resulting SoPR and SoWR evaluations are performed three times and reported as mean values.
mould | SoPR | SoWR | ||||||
---|---|---|---|---|---|---|---|---|
I1 | I2 | I3 | Avg. | I1 | I2 | I3 | Avg | |
Use of real tools (G.T.) | ||||||||
GPT-3.5 | 56.60 | 47.80 | 54.64 | 50.91 | - | - | - | - |
ToolLlama-2 | 53.37 | 41.98 | 46.45 | 48.43 | 47.27 | 59.43 | 27.87 | 47.58 |
ToolLlama | 55.93 | 48.27 | 52.19 | 52.78 | 50.31 | 53.77 | 31.15 | 47.88 |
ToolGen | 61.35 | 49.53 | 43.17 | 54.19 | 51.53 | 57.55 | 31.15 | 49.70 |
Using the Retriever (R.) | ||||||||
GPT-3.5 | 51.43 | 41.19 | 34.43 | 45.00 | 53.37 | 53.77 | 37.70 | 50.60 |
ToolLlama-2 | 56.13 | 49.21 | 34.70 | 49.95 | 50.92 | 53.77 | 21.31 | 46.36 |
ToolLlama | 54.60 | 49.96 | 51.37 | 51.55 | 49.08 | 61.32 | 31.15 | 49.70 |
ToolGen | 56.13 | 52.20 | 47.54 | 53.28 | 50.92 | 62.26 | 34.42 | 51.51 |
5.2 Results
Table 4 shows the end-to-end evaluation performance of each model in two environments: using the real tool (G.T.) and the retriever (R.). In the G.T. environment, ToolGen achieves an average SoPR score of 54.19, outperforming GPT-3.5 and ToolLlama, and ToolGen achieves the highest SoWR score of 49.70. ToolGen still maintains the lead in the Retriever environment, with an average SoPR of 53.28 and SoWR of 51.51. ToolLlama shows a competitive performance, outperforming the real tool on some individual instances. ToolLlama shows competitiveness and outperforms ToolGen on some individual instances. end-to-end ToolGen ablation studies are presented in Appendix G.
Table 5: End-to-end evaluation of different indexing methods.
Indexing Methods | SoPR | SoWR | ||||||
---|---|---|---|---|---|---|---|---|
I1 | I2 | I3 | Avg. | I1 | I2 | I3 | Avg | |
numeric index | 34.76 | 29.87 | 46.99 | 35.45 | 25.77 | 33.02 | 29.51 | 28.79 |
hierarchical indexing | 50.20 | 45.60 | 32.79 | 45.50 | 38.04 | 43.40 | 29.51 | 38.18 |
semantic indexing | 58.79 | 45.28 | 44.81 | 51.87 | 49.69 | 57.55 | 26.23 | 47.88 |
atomic index | 58.08 | 56.13 | 44.81 | 55.00 | 47.85 | 57.55 | 29.51 | 47.58 |
5.3 Comparison of Indexing Methods
Similar to the comparison of indexing methods for the retrieval task (Section 4.3 ), Table 5 shows the comparison of the different indexing methods in the end-to-end agent task. In this setup, restricted decoding is removed, allowing the agent to freely generate Thought, Action, and Parameters.From the results, the atomic indexing method performs the best among the four indexing methods. We attribute this to the higher phantom rates of the other methods, as discussed in Section 5.4.
Figure 4: Shows the phantom rate of different models when generating non-existent tools. ToolGen does not generate non-existent tools when using restricted decoding. However, without this restriction, ToolGen generates 7% non-tool tokens in the Action generation phase when using atomic indexing, and the illusion rate is higher when using semantic indexing. For ToolLlama and GPT-3.5, the illusion occurs even if five real tools are provided in the hint. If no tool is specified in the hint, ToolLlama generates non-existing tool names in excess of 50%.
5.4 Hallucinations
We evaluate the model's phantom situation in tool generation in an end-to-end agent scenario. To do so, we enter queries that are formatted consistently with the model's training. Specifically, for ToolGen, we directly enter the query and prompt the model to generate the response according to the ToolGen agent paradigm (i.e., Thought, Tool, and Parameters in order). We tested Action decoding without using the bundle search restrictions described in Section 3.6. For ToolLlama and GPT-3.5, we entered the query and included 5 real tools. In all settings, we report the proportion of tools not present in the dataset among those generated in all tool generation actions. Figure 4 illustrates the phantom rate of non-existent tools generated by different models. As can be seen from the figure, ToolLlama and GPT-3.5 are still likely to generate non-existent tool names despite the fact that only five real tools are provided. In contrast, ToolGen completely avoids hallucinations through its restricted decoding design.
6 Conclusion
In this paper, we introduce ToolGen, a large language modeling (LLM) framework that unifies tool retrieval and execution by embedding tool-specific virtual tokens into a model vocabulary, thereby transforming tool interaction into a generative task. By including a three-stage training process, ToolGen empowers LLMs to efficiently retrieve and execute tools in real-world scenarios. This unified approach sets a new benchmark for scalable and efficient AI agents capable of handling large tool libraries. Going forward, ToolGen opens the door to integrating advanced technologies such as chain-of-thought reasoning, reinforcement learning, and ReAct, further enhancing the autonomy and versatility of LLMs in real-world applications.
References
- Asai et al. (2023)↑Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023.URL https://arxiv.org/abs/2310.11511.
- Brohan et al. (2023)↑ Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee , Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart. Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich.Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.URL https://arxiv.org/abs/2307.15818.
- Brown et al. (2020)↑Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler. Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.). Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- Chen et al. (2023a) ↑Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao.Fireact: Toward language agent fine- tuning.arXiv preprint arXiv:2310.05915, 2023a.
- Chen et al. (2023b)↑Jiangui Chen, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Yixing Fan, and Xueqi Cheng.Continual Learning for Generative Retrieval over Dynamic Corpora.In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM '23, pp. 306-315, New York, NY, USA, 2023b. Association for Computing Machinery. isbn 9798400701245. doi: 10.1145/ 3583780.3614821.URL https://dl.acm.org/doi/10.1145/3583780.3614821.
- Chen et al. (2024a)↑Junzhi Chen, Juhao Liang, and Benyou Wang.Smurfs: Leveraging multiple proficiency agents with context-efficiency for tool planning, 2024a.URL https://arxiv.org/abs/2405.05955.
- Chen et al. (2024b)↑Yanfei Chen, Jinsung Yoon, Devendra Singh Sachan, Qingze Wang, Vincent Cohen-Addad, Mohammadhossein Bateni, Chen-Yu Lee, and Tomas Pfister.Re-invoke: tool invocation rewriting for zero-shot tool retrieval.arXiv preprint arXiv:2408.01875, 2024b.
- Chen et al. (2024c)↑Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao.Agent-flan: designing data and methods of effective agent tuning for large language models, 2024c.URL https://arxiv.org/abs/2403.12881.
- Dao (2024)↑Tri Dao.FlashAttention-2: Faster attention with better parallelism and work partitioning.In International Conference on Learning Representations (ICLR), 2024.
- Dao et al. (2022)↑Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré.FlashAttention: fast and memory-efficient exact attention with IO-awareness.In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Dubey et al. (2024)↑Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.
- Gravitas (2023)↑Gravitas.AutoGPT, 2023.URL https://github.com/Significant-Gravitas/AutoGPT.
- Guo et al. (2024)↑Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu.StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models, 2024.URL https://arxiv.org/abs/2403.07714.
- Hao et al. (2023)↑Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu.Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 USA, December 10 - 16, 2023, 2023.URL http://papers.nips.cc/paper_files/paper/2023/hash/8fd1a81c882cd45f64958da6284f4a3f-Abstract-Conference.html.
- Järvelin & Kekäläinen (2002)↑Kalervo Järvelin and Jaana Kekäläinen.Cumulated gain-based evaluation of ir techniques.ACM Transactions on Information Systems (TOIS), 20(4):422-446, 2002.
- Karpukhin et al. (2020)↑Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.Dense passage retrieval for open-domain question answering.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769-6781, Online, 2020. association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550.URL https://aclanthology.org/2020.emnlp-main.550.
- Kishore et al. (2023)↑Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, and Kilian Q. Weinberger.Incdsi: Incrementally updatable document retrieval.In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.). International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 17122-17134. PMLR, 2023.URL https://proceedings.mlr.press/v202/kishore23a.html.
- Liu et al. (2024)↑Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench. Evaluating llms as agents.In The Twelfth International Conference on Learning Representations, 2024.
- Liu et al. (2023)↑ Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, et al. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents.arXiv preprint arXiv:2308.05960, 2023.
- Mehta et al. (2023)↑Sanket Vaibhav Mehta, Jai Gupta, Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Jinfeng Rao, Marc Najork, Emma Strubell, and Donald Metzler. DSI++: Updating transformer memory with new documents.In Houda Bouamor, Juan Pino, and Kalika Bali (eds.). Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 8198-8213, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.510.URL https://aclanthology.org/2023.emnlp-main.510.
- OpenAI (2024)↑OpenAI.Gpt-4 technical report, 2024.URL https://arxiv.org/abs/2303.08774.
- Patil et al. (2023) ↑Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez.Gorilla: Large language model connected with massive apis, 2023. URL https://arxiv.org/abs/2305.15334.
- Qiao et al. (2024)↑Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, and Huajun Chen.Autoact. Automatic agent learning from scratch for qa via self-planning, 2024.URL https://arxiv.org/abs/2401.05268.
- Qin et al. (2023)↑Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun.ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023.URL https://arxiv.org/abs/2307.16789.
- Qu et al. (2024)↑Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen.Tool learning with large language models: a survey.arXiv preprint arXiv:2405.17935, 2024.
- Rajbhandari et al. (2020)↑Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.Zero: Memory optimizations toward training trillion parameter models, 2020.URL https://arxiv.org/abs/1910.02054.
- Robertson et al. (2009)↑Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333-389, 2009.
- Schick et al. (2023)↑Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom.Toolformer: language models can teach themselves to use tools, 2023.URL https://arxiv.org/abs/2302.04761.
- Shen et al. (2024)↑Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang.Small llms are weak tool learners: a multi-llm agent, 2024.URL https://arxiv.org/abs/2401.07324.
- Shinn et al. (2023)↑Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao.Reflexion: language agents with verbal reinforcement learning.In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 8634-8652, 2023.
- Sun et al. (2023)↑Weiwei Sun, Lingyong Yan, Zheng Chen, Shuaiqiang Wang, Haichao Zhu, Pengjie Ren, Zhumin Chen, Dawei Yin, Maarten de Rijke, and Zhaochun Ren.Learning to tokenize for generative retrieval.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 USA, December 10 - 16, 2023, 2023.URL http://papers.nips.cc/paper_files/paper/2023/hash/91228b942a4528cdae031c1b68b127e8-Abstract-Conference.html.
- Touvron et al. (2023)↑Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample.Llama: open and efficient foundation language models, 2023.URL https://arxiv.org/abs/2302.13971.
- Wang et al. (2024)↑Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, and Timothy Baldwin.Learning from failure: integrating negative examples when fine -tuning large language models as agents.arXiv preprint arXiv:2402.11651, 2024.
- Wang et al. (2022)↑Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, and Mao Yang. Hao Sun, Weiwei Deng, Qi Zhang, and Mao Yang.A neural corpus indexer for document retrieval.In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K . Cho, and A. Oh (2008), a neural corpus indexer for document retrieval. . Cho, and A. Oh (eds.). Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 New Orleans, LA, USA, November 28 - December 9, 2022, 2022.URL http://papers.nips.cc/paper_files/paper/2022/hash/a46156bd3579c3b268108ea6aca71d13-Abstract-Conference.html.
- Wei et al. (2023)↑Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou.Chain-of-thought prompting elicits reasoning in large language models, 2023.URL https://arxiv.org/abs/2201.11903.
- Wu et al. (2024a)↑Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang.Autogen: Enabling next-gen llm applications via multi-agent conversation framework.In COLM, 2024a.
- Wu et al. (2024b)↑Qinzhuo Wu, Wei Liu, Jian Luan, and Bin Wang.ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback, 2024b.URL https://arxiv.org/abs/2409.14826.
- Xiong et al. (2021)↑Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk.Approximate nearest neighbor negative contrastive learning for dense text retrieval.In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.URL https://openreview.net/forum?id=zeFrfgyZln.
- Xu et al. (2024)↑Qiancheng Xu, Yongqi Li, Heming Xia, and Wenjie Li.Enhancing tool retrieval with iterative feedback from large language models.arXiv preprint arXiv:2406.17465, 2024.
- Yao et al. (2023)↑Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.ReAct: Synergizing reasoning and acting in language models.In International Conference on Learning Representations (ICLR), 2023.
- Yin et al. (2024)↑Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin.Agent lumos: Unified and modular training for open-source language agents.In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.). Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12380-12403, Bangkok, Thailand, August 2024. association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.670. URL https://aclanthology.org/2024.acl-long.670.
- Zeng et al. (2023) ↑Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang.Agenttuning: enabling generalized agent abilities for llms, 2023.
An example of a real tool
seek 5 An example of a real tool is shown. Each tool contains multiple APIs. the following fields are used in our experiments: "tool_name" is the name of the tool. "tool_description" describes information about the tool, such as what it does. In each API, "name" is the name of the API. "description" describes information about the API. "method" is the http method that calls the API. "required_parameters" is the parameters that must be filled in when calling the API. "optional_parameters" can be used to set additional parameters (optional).
{
"tool_name": "YouTube Hub",.
"tool_description": "Get details on the number of likes, views, title, thumbnail, etc. for a single video." ,
"home_url": "https://rapidapi.com/itsrohitofficial-XBPdXttOUQ/api/youtube-hub/",
"host": "youtube-hub.p.rapidapi.com",
"api_list":[
{
"name": "Get Video Details",
"url": "https://youtube-hub.p.rapidapi.com/",
"description": "Get all the basic information about a video",
"required_parameters":[
{
"name": "id",
"type": "STRING",
"description":"",
"default": "fD6SzYIRr4c"
}
], "optional_parameters".
"optional_parameters":[], .
}
]
}
Figure 5: An example of a real tool. The tool contains an API and unnecessary fields have been removed for simplicity.
B Tool virtualization implementation
ToolGen uses a single unique Token to represent a tool, which demonstrates its strength in tool retrieval and invocation. We have also introduced other methods for indexing tools, including semantic, numeric, and hierarchical. Below is a detailed description of how we implement each type of indexing.
atomic
Indexing is the method we use in ToolGen. In contrast to the other methods, it uses a single Token as the tool and does not fictionalize a tool that does not exist. We use <> to combine the tool name and the API name to form a single Token. e.g., in the Appendix A The example in this section results in a Token of <>.
meaning of words
The index maps each tool to a name used in ToolBench, which is also a combination of the tool name and the API name. However, the name can be broken down into multiple Token so that the model can perceive its semantic meaning. In the example in Appendix A, the resulting mapping is get_video_details_for_youtube_hub.
numeric
The index maps each tool to a unique number. We begin by obtaining a list of all tools, which is approximately 47,000 in length, and for all tools, we use a five-digit number, separated by spaces, to represent the tool. If the example in Appendix A is the 128th element in the list, we use 0 0 0 0 1 2 8 to represent the tool. Since the Llama-3 lexicon encodes each number
layering
also maps each tool to a number. Unlike numeric indexing, we inject structural information into the tool representation through iterative clustering. In each iteration, we cluster the tools into ten clusters, each assigned a number from 0 to 9. For each cluster, we repeat this clustering process until there is only one tool in that cluster. These steps form a clustering tree. We go from the root to the leaf numbers as a representation of the tools in that leaf. The examples in Appendix A may be assigned a number longer than five digits, such as 0 1 2 2 3 3 3.
C Binding search details
When retrieving and completing end-to-end agent tasks, we use constrained bundle search to restrict the generated operations to valid tool tokens.The detailed steps are described in Algorithm 1.The basic idea is to restrict the search space during the bundle search step. To do this, we first construct a disjoint dictionary tree, where each node represents a tool Token ID, and the children of that node are all feasible IDs that follow the current ID. using this tree, we can determine all possible next Token IDs based on the currently searched IDs. in the bundle search step, we block out all other infeasible Token logits, forcing possible IDs to be sampled or searched.
For retrieval, this can be applied directly during generation. For the end-to-end agent task, since we decompose the reasoning step into three dialog rounds, we can easily detect when ToolGen needs to generate an action and therefore apply the constraint. Figure 6 shows an example of end-to-end reasoning for ToolGen, where there is no associated tool for ToolGen to choose from. It can just generate the ToolToken and complete the task.
Table 6: Statistics of the dataset for the three stages of training. For tool memorization, there are some duplicates, resulting in a slightly larger sample size than the number of tools we used.
data set | tool memory | Search training | End-to-end agent tuning | |||
---|---|---|---|---|---|---|
I1 | I2 | I3 | full | |||
#num | 49,936 | 194,086 | 222,783 | 72,833 | 489,702 | 183,336 |
SYSTEM: You are an AutoGPT with the ability to utilize numerous tools and features to complete a given task.
1. first, I will provide you with a description of the task you are about to begin. 2.
2. at each step, you will determine the next action by generating an action token. 3.
3. after the token, you will receive the action document that corresponds to that token. You will need to generate an input for that action that will transfer you to a new state. You will then determine the next step and repeat the process. 4.
4. After several iterations of generating multiple actions and inputs, you will finalize the task and provide your final answer.
... ...
: I recently visited Viva City and was impressed with the menu. I would like to know more about the menu items of the venue 5a76c927bb5d660001e09731. Can you provide menu categories, prices, availability, rules, set menu items and other relevant details? Please use the language 'zh-hans' to provide menu details.
: I need to call the "<>" function to retrieve the menu details for venue 5a76c927bb5d660001e09731. The reason for calling this function is to provide the user with the menu categories, prices, availability, rules, package items and other relevant details. This function is called with the parameters "lang" set to "zh-hans" and "is_id" set to "5a76c927bb5d660001e09731" to ensure that the menu details provided are in the desired language and are venue specific.
: Generate action.
: <>
: Give the input. Here is the documentation: {'name': '<>', 'description': 'Use KV to get venue details (need to re-sync to get the latest data) and filter the dynamic multi-language data based on query parameters.' , 'required': [{'name': 'id', 'type': 'STRING', 'description': '', 'default': '56c467cb7b0d464811a42017'}, {'name': 'lang', 'type': 'STRING', ' description': '', 'default': 'zh-hans'}], 'optional': []}
::{\n "is_id": "5a76c927bb5d660001e09731",\n "lang": "zh-hans"\n}
: {'error': '', 'response': {'menu_id': '5a76c927bb5d660001e09731', 'language': 'zh-hans', 'menu_details': {'sections': [{'section_name': ' Appetizers', 'items': [{'item_name': 'Spring Rolls', 'description': 'Crispy spring rolls with special sauce', 'price': 12.99}, {'item_name': 'Chicken Skewers', 'description': 'Spicy chicken kebabs, crispy on the outside and tender on the inside', 'price'. 15.99}]}, {'section_name': 'Main Dishes', 'items': [{'item_name': 'Kung Pao Chicken', 'description': 'Classic Szechuan dish, spicy and flavorful', 'price': 18.99}, {'item_name': 'Steamed Seabass', ' description': 'Fresh and light, nutritious and healthy', 'price': 22.99}]}], 'special_notes': 'Vegetarian options are available on this menu, please inform your server.'}}
: Venue 5a76c927bb5d660001e09731 The menu details are as follows:\n- Menu Category: Chinese Food\nPrice: \n - Item 1: 12.99 CNY\n - Item 2: 15.99 CNY\n - Item 3: 18.99 CNY\n - Item 4: 22.99 CNY\nAvailability: \n - Item 1: Available\n - Item Item 2: Available\n - Item 3: Available\n - Item 4: Available\n- Rules: \n - Item 1: Specific rules not mentioned\n - Item 2: Specific rules not mentioned\n - Item 3: Specific rules not mentioned\n - Item 4: Specific rules not mentioned\n- Package Items: \n - Item 1: 2 Packages\n - Item 2: 2 Packages\n - Item 3: 2 Packages\n - Item 4 : 2 Set Meals\n- Other relevant details:\n - Special Note: Menus are subject to change and prices are subject to change. Please check with the venue for the latest information. \n\nPlease let me know if there is anything else I can help you with!
: Generate action.
: <>
: Please give input. Here is the documentation: {'required': [], 'optional': [{'name': 'give_answer', 'description': 'Output answer', 'type': 'string'}, {'name': 'give_up_and_restart', ' description': 'Unable to process task from this step', 'type': 'string'}]}
: {"return_type": "give_answer", "final_answer": "The menu details for Venue 5a76c927bb5d660001e09731 are as follows:\n- Menu Category: Chinese Cuisine\n- Price: \n - Item 1: 12.99 CNY\n - Item 2: 15.99 CNY\n- Item 3: 18.99 CNY\n - Item 4: 22.99 CNY\n- Availability: \n - Item 1: Available \n - Item 2: Available \n - Item 3: Available \n - Item 4: Available \n- Rules: \n - Item 1: Specific rule not mentioned \n - Item 2: Specific rule not mentioned \n - Item 3: Specific rule not mentioned \n - Item 4: Specific rule not mentioned Rules\n- Set menu items: \n - Item 1: 2 set menus\n - Item 2: 2 set menus\n - Item 3: 2 set menus\n - Item 4: 2 set menus\n- Other relevant details: \n - Special Note: Menus are subject to change and prices are subject to change. Please check with the venue for the latest information."}
Figure 6: Example of reasoning with ToolGen. First a system prompt is given, with no associated tool. Then the user gives a task query, ToolGen generates a thought, and we use the user role prompt model to generate an action. After generating the action, we again use the user to provide a tool document. The model will generate tool inputs based on this documentation.
E Ablation test
Table 7 shows the ablation results for the end-to-end evaluation. For unseen instructions, ToolGen Agent performs slightly better without tool memory or retrieval training. However, for unseen tools, both SoPR and SoWR drop without the first two stages of training. This suggests that the first two stages of training play a role in ToolGen's generalization ability, while retrieval training is more important than tool memory.
Table 7: Ablation results from end-to-end evaluation of ToolGen. Here Inst. represents unseen queries (commands) and Tool. and Cat. represent tools not seen during training.
mould | SoPR | SoWR | ||||||
---|---|---|---|---|---|---|---|---|
I1-Inst. | I2-Inst. | I3-Inst. | Avg. | I1-Inst. | I2-Inst. | I3-Inst. | Avg. | |
ToolGen | 54.60 | 52.36 | 43.44 | 51.82 | 50.31 | 54.72 | 26.23 | 47.28 |
w/o retrieval training | 56.95 | 46.70 | 50.27 | 52.42 | 49.69 | 50.94 | 34.43 | 47.27 |
w/o memorization | 56.03 | 47.96 | 57.38 | 53.69 | 49.08 | 59.43 | 34.43 | 49.70 |
I1-Tool. | I1-Cat. | I2 Cat. | Avg. | I1-Tool | I1-Cat. | I2 Cat. | Avg. | |
ToolGen | 56.54 | 49.46 | 51.96 | 52.66 | 40.51 | 39.87 | 37.90 | 39.53 |
w/o retrieval training | 49.47 | 40.31 | 37.90 | 42.84 | 36.71 | 30.07 | 36.29 | 34.18 |
w/o memorization | 58.86 | 46.19 | 49.87 | 51.70 | 37.34 | 38.56 | 42.74 | 39.32 |
F Generalization capability
For ToolGen Agent, we measure the model's performance on untrained tool queries. Table 8 shows the end-to-end evaluation of the model on unseen tools.The performance of ToolGen Agent is lower than ToolLlama, which also indicates poor generalization ability when completing the full task. The generalization problem is prevalent in generative retrieval and is beyond the scope of this paper. Therefore, we leave it for future research.
Table 8: Generalization results for ToolGen. We tested and compared ToolGen's performance with other models on queries that asked for unseen tools during training.
mould | set up | SoPR | SoWR | ||||||
---|---|---|---|---|---|---|---|---|---|
I1-Tool. | I1-Cat. | I2 Cat. | Avg. | I1-Tool | I1-Cat. | I2 Cat. | Avg | ||
GPT-3.5 | GT. | 58.90 | 60.70 | 54.60 | 58.07 | - | - | - | - |
ToolLlama | GT. | 57.38 | 58.61 | 56.85 | 57.68 | 43.04 | 50.31 | 54.84 | 49.04 |
ToolGen | GT. | 52.32 | 40.46 | 39.65 | 47.67 | 39.24 | 38.56 | 40.32 | 39.30 |
GPT-3.5 | look up | 57.59 | 53.05 | 46.51 | 52.78 | 46.20 | 54.25 | 54.81 | 51.58 |
ToolLlama | look up | 57.70 | 61.76 | 45.43 | 54.96 | 48.73 | 50.98 | 44.35 | 48.30 |
ToolGen | 56.54 | 49.46 | 51.96 | 52.66 | 40.51 | 39.87 | 37.90 | 39.53 |
Adapting ToolBench data to ToolGen
Our ToolGen data is adapted and transformed from ToolBench data. Specifically, we use tool documents as the data for tool memory training, where the inputs are tool documents and the outputs are the corresponding Token.
For retrieval training, we use the data annotated for tool retrieval in ToolBench, where a query is annotated with multiple related tools. We take the query as input and convert the relevant tools into virtual tokens, which are then used as the output of the retrieval training.
For end-to-end smartbody tuning, we use interaction trajectories as a source with the following transformations:(1) Each trajectory contains available tools from system hints that can be used to solve the query. When completing a task, ToolLlama relies on the tools retrieved in the system hints to solve the task, whereas ToolGen can generate tools directly. Therefore, we remove the tools from the system hints. (2) We replace all the tool names in the trajectory with the corresponding virtual tool Token.(3) In the original trajectory, the Intelligent Body Model generates Thought, Action, and Action Input (also known as ReAct) sequentially. We decompose the whole ReAct into three dialog rounds. In the first round, the smart body model generates a Thought and we use the user to prompt the model to generate an Action. In the second round, the model generates actions, i.e., virtual tool tokens, and then we obtain documents corresponding to these tokens so that the model knows what parameters to specify. In the third round, the model generates parameters for the tool.
The number of samples in each dataset is shown in Table 6. samples for tool memory and retrieval training are shown in Fig. 7. samples for end-to-end intelligences tuning are shown in Fig. 8.
# Tool Memory
User: Tool Name: Thai Drivers License OCR Tool Description: Extracts information from Thai Drivers License and returns text results, such as Driver's License Number and Personal Information API Name: Driver's License API Description: Extracts information from Thai Drivers License and returns text results, such as Driver's License Number and Personal Information.
Helper: <<Thai Drivers License OCRDriver's License
# Retrieval Training
User: My friend and I are organizing a hackathon on "Web Development" and "Mobile Application Development". We need some inspiration and guidance. Can you get popular stories on these topics from Medium.com?
Help: <>
Fig. 7: Example dataset for tool memory and retrieval training. We use user roles to represent inputs and assistant roles to represent outputs.
# End-to-End Agent Tuning
System:
You are an AutoGPT and are able to utilize numerous tools and features to complete the given task.
1. first, I will provide you with a task description and your task will begin. 2.
2. At each step, you determine the next action by generating an Action Token. 3.
3. After the Token, you will receive documentation for the action that corresponds to the Token. You need to generate inputs for the action that will help you move to the new state. You will then make a decision about the next step and repeat the process. 4.
4. After several iterations of generating multiple actions and inputs, you will eventually complete the task and provide your final answer.
Remember:
1. state changes are irreversible and you cannot return to a previous state.
2. keep your operations simple and limited to what best suits the current query.
3. You can make multiple attempts. If you plan to keep trying different conditions, try one condition at a time. 4.
4. if you think you have gathered enough information, generate the action "<> with argument give_answer" to provide your answer to the task.
5. if you feel that you cannot process the task at this step, generate the action "<> with argument give_up_and_restart".
Let's get started!
TASK DESCRIPTION: You should use the operation to process live user queries. Remember:
1. always generate "<>" at the end of the task. The final answer should contain enough information to be presented to the user. If you are unable to process the task, or find that the operation always fails (this feature is not currently working), use the operation <> with give_up_and_restart.
2. only operations and inputs are generated.
User:
Can you get the address for the zip code 'PL11DN'? I would like to know the number of items found, district, sub-district, county, country and geocode details (eastbound, northbound, latitude and longitude).
Start!
Helper:
Based on the user's query, they would like to get the address for the zip code 'PL11DN' and retrieve specific details such as the number of items found, district, sub-district, county, country, and geo-coding details. The current API call is for the function 'find_by_postcode' with the parameter 'postcode' set to 'PL11DN'. This API is useful as it can provide addresses based on a given UK zip code. By using this API, I will be able to retrieve the required information for a query from a user.
User:
Generate the operation.
Helper:
<>
User:
Give the input. This is the documentation: {'name': '<>', 'description': '"Find By PostCode" API gets an address based on a given UK zip code. This API provides a JSON interface to search for addresses in UK zip codes.' , 'required': [{'name': 'postcode', 'type': 'string', 'description': '', 'default': 'PL11DN'}], 'optional': []}
Helper:
{
"postcode": "PL11DN"
}
Tools:
{ "error": "", "response": "{'status': 200, 'message': 'Success', 'noOfItems': 37, 'result': {'postcode': 'PL1 1DN', 'district': 'Plymouth', 'district': 'St. Peter's and Waterfront ', 'county': 'Devon', 'country': 'England', 'geocode': {'eastings': '248025', 'northings': '0054420', 'lattitude': '50.370001', 'longitude': '- 4.138397'}, 'addresses': [{'po_box': '', 'line_2': '', 'postcode_type': 'S', 'line_3': '', 'su_organization_indicator': ' ', 'building_number'. '5', 'udprn': '18911184', 'thoroughfare': 'St. Andrew's Cross', 'line_1': '5 St. Andrew's Cross', 'postcode_outcode': 'PL1', 'postcode_compact': 'PL11DN', 'dependant_ thoroughfare': '', 'premise': '5', 'post_town': 'Plymouth', 'building_name': '', 'organization_name': '', 'postcode': 'PL1 1DN', 'department_name'. '', 'postcode_incode': '1DN', 'address': '5 St. Andrew's Cross, Plymouth, PL1 1DN', 'double_dependant_locality': '', 'sub_building_name': '', 'dependent_ locality': '', 'delivery_point_suffix': '4T'}, {'po_box': '', 'line_2': '2 St. Andrew's Cross', 'p...
User:
Generate the operation.
Helper:
<<Finish
User:
Please give input. This is the documentation: {'required': [], 'optional': [{'name': 'give_answer', 'description': 'Output answer', 'type': 'string'}, {'name': 'give_up_and_restart', ' description': 'Unable to process tasks for this step', 'type': 'string'}]}
Helper:
{
"return_type":'give_answer', 'final_answer'.
"final_answer": "Number of items found: 37\nDistrict: Plymouth\nSubdivision: St. Peter's and the Waterfront\nCounty: Devon\nCountry: England\nGeocode details: \n- Eastbound: 248025\n- Northbound: 0054420\n- Latitude: 50.370001\nLongitude: -4.138397"
}
! [[Pasted image 20241027190429.png]]
Figure 8: Example of end-to-end agent tuning.