Application Evaluation of the Use of Inference Models in Modular RAG Systems

AI Knowledge Base2mos agoupdate AI Sharing Circle

1.1K 00

In this paper, we present Kapa.ai's recent work on OpenAI's Retrieval-Augmented Generation (RAG) system in the o3-mini A summary report of the exploration of the etymological model of reasoning.

Kapa.ai is an AI assistant powered by a large-scale language model (LLM), which can be used by the RAG The process is integrated with the knowledge base so that technical questions from users can be answered and technical support work orders can be processed.

Building and maintaining a stable and versatile RAG system is not an easy task. Many parameters and settings affect the quality of the final output, and there are complex interactions between these factors:

Cue word templates
Context size
Query Extension
chunking
reorder
wait a minute!

When making adjustments to a RAG system, especially when integrating a new model, revisiting and optimizing these parameters is critical to maintaining good performance. However, this task is not only time-consuming, but also requires a great deal of experience to do well.

look as if DeepSeek-R1 Novel reasoning models, such as OpenAI's o3-mini, have achieved impressive results by using built-in Chain-of-Thought (CoT) prompts to "think" about a problem, reason step-by-step, and even self-correct when necessary. The models reportedly perform better on complex tasks that require logical reasoning and verifiable answers. Related reading:DeepSeek R1 in RAG: Practical Experience Summary ,Project-level code generation results are in! o3/Claude 3.7 leads the way, R1 is in the top tier!

Thus, Kapa.ai proposes an idea: if inference models are able to decompose complex problems and self-correct, could they be applied to RAG processes to handle tasks such as query expansion, document retrieval, and reordering? By building an information retrieval toolkit and handing it over to the inference model, it may be possible to build a more adaptive system that reduces the need for manual tuning of parameters.

This paradigm is sometimes called Modular Retrieval-Augmented Generation (Modular RAG). In this paper, we share Kapa.ai's recent research results in refactoring the standard RAG process into an inference model-based process.

suppose that...

Kapa.ai's main goal in exploring this idea is to simplify the RAG process and reduce the reliance on manual fine-tuning of parameters.The core components of the RAG process are dense embedding and document retrieval. A typical high-level RAG process is as follows:

Receive user prompts.
Preprocessing queries to improve information retrieval.
Relevant documents are found through similarity searches in vector databases.
Reorder the results and use the most relevant documents.
Generate a response.

Each step in the process is optimized through heuristics such as filtering rules and sorting adjustments to prioritize relevant data. These hard-coded optimizations define the behavior of the process but also limit its adaptability.

In order for the inference model to utilize the various components of the RAG process, Kapa.ai needed to set up the system differently. Instead of defining a linear sequence of steps, each component is treated as a separate module for the model to call.

In this architecture, instead of following a fixed process, models with reasoning capabilities can dynamically control their workflow to a greater extent. By utilizing tools, models can decide when and how often to run full or simplified searches, and what search parameters to use. If successful, this approach has the potential to replace traditional RAG orchestration frameworks such as LangGraph.

In addition, a more modular system offers some additional benefits:

Individual modules can be replaced or upgraded without completely revamping the entire process.
Clearer segregation of duties makes debugging and testing easier to manage.
Different modules (e.g., retrievers with different embeddings) can be tested and replaced to compare performance.
Modules can be extended independently for different data sources.
It could potentially allow Kapa.ai to build different modules customized for specific tasks or domains.

Finally, Kapa.ai also wants to explore whether this approach can help "short-circuit" abusive or off-topic queries more effectively. The most difficult cases usually involve ambiguity, i.e., it is not clear whether the query is relevant to the product. Abusive queries are often deliberately designed to evade detection. While simpler cases can already be handled effectively, Kapa.ai hopes that the inference model will help identify and exit more complex problems earlier.

test setup

To experiment with this workflow, Kapa.ai built a sandboxed RAG system containing the necessary components, static data, and an evaluation suite with LLM as the referee. In one configuration, Kapa.ai used a typical fixed linear process with hard-coded optimizations built in.

For the modular RAG process, Kapa.ai used the o3-mini model as an inference model and ran various configurations of the process under different policies to evaluate which approaches worked and which did not:

Tool Use: Kapa.ai tries to give the model full access to all tools and the entire process, and also tries to limit tool use to a combination of a single tool with a fixed linear process.
Hints and parameterization: Kapa.ai tested both open-ended cues using minimal instructions and highly structured cues.Kapa.ai also experimented with various degrees of pre-parameterized tool calls, rather than letting the model determine its own parameters.

In all tests conducted by Kapa.ai, the number of tool calls was limited to a maximum of 20 - for any given query, the model only allows a maximum of 20 tool calls to be used.Kapa.ai also ran all tests at medium and high inference strengths:

Medium: Shorter CoT (Chain of Thought) Steps
Higher: Longer CoT steps with more detailed reasoning

In total, Kapa.ai conducted 58 evaluations of different modular RAG configurations.

in the end

Experimental results were mixed. In some configurations, Kapa.ai observed some modest improvements, most notably in code generation and, to a limited extent, factoring. However, key metrics such as information retrieval quality and knowledge extraction remained largely unchanged compared to Kapa.ai's traditional manually-tuned workflow.

A recurring issue throughout the testing process is that Chain of Thought (CoT) reasoning adds latency. While deeper reasoning allows the model to decompose complex queries and self-correct, this comes at the cost of additional time required for iterative tool calls.

The biggest challenge identified by Kapa.ai is the "inference ≠ experience fallacy": the inference model, despite being able to think step-by-step, lacks a priori experience with retrieval tools. Even with rigorous hints, it struggled to retrieve high-quality results and to distinguish between good and bad outputs. The model is often hesitant to use the tools provided by Kapa.ai, similar to the experiments Kapa.ai conducted last year using the o1 model. This highlights a broader problem: inference models excel at abstract problem solving, but optimizing the use of tools without prior training remains an outstanding challenge.

Main findings

The experiment revealed an obvious "reasoning ≠ experience fallacy": the reasoning model itself does not "understand" the retrieval tool. It understands the function and purpose of the tool, but does not know how to use it, whereas humans possess this tacit knowledge after using the tool. Unlike traditional processes, where experience is encoded in heuristics and optimization, the reasoning model must be explicitly taught how to use the tool effectively.
Although the o3-mini model is capable of handling larger contexts, Kapa.ai observes that it does not provide significant improvement over models such as 4o or Sonnet in terms of knowledge extraction. Simply increasing the context size is not a panacea for improving retrieval performance.
Increasing the inference strength of the model will only slightly improve factual accuracy. kapa.ai's dataset focuses on technical content relevant to real-world use cases, rather than math contest problems or advanced coding challenges. The impact of inference strength may vary by domain and may produce different results for datasets containing more structured or computationally complex queries.
One area in which the model does excel is code generation, suggesting that inference models may be particularly well suited to domains that require structured, logical output rather than pure retrieval.
Inferential models do not have tool-related knowledge.

Reasoning ≠ Empirical Fallacy

The key conclusion of the experiments is that the inference model does not naturally possess tool-specific knowledge. Unlike finely tuned RAG processes, which encode retrieval logic into predefined steps, inference models process each retrieval call from scratch. This leads to inefficiency, indecision, and sub-optimal tool usage.

To mitigate this, several possible strategies can be considered. Further refinement of the cueing strategy, i.e., constructing tool-specific instructions in such a way as to provide more explicit guidance to the model, may be helpful. Pre-training or fine-tuning the model for tool use could also familiarize it with specific retrieval mechanisms.

In addition, a hybrid approach can be considered, where predefined heuristics handle certain tasks and inference models selectively intervene when needed.

These ideas are still in the speculative stage, but they point to ways to bridge the gap between reasoning ability and actual tool implementation.

summarize

While the modular inference-based RAG did not show significant advantages over traditional processes within the context of Kapa.ai's use cases, the experiment did provide valuable insights into its potential and limitations. The flexibility of a modular approach remains attractive. It can improve adaptability, simplify upgrades, and dynamically adjust to new models or data sources.

Looking ahead, a number of promising technologies deserve further exploration:

Use different cueing strategies and pre-training/fine-tuning to improve the way the model understands and interacts with the retrieval tool.
Use reasoning models strategically in certain parts of the process, e.g., for specific use cases or tasks such as complex question answering or code generation, rather than orchestrating the entire workflow.

At this stage, reasoning models like o3-mini have not outperformed traditional RAG processes for core retrieval tasks within reasonable time constraints. As models advance and tool usage strategies evolve, modular reasoning-based RAG systems may become a viable alternative, especially for domains requiring dynamic, logic-intensive workflows.