introductory
In the application domain of Large Language Models (LLMs), especially in the area of retrieval-enhanced generation (RAG) system, text chunking plays a crucial role. The quality of text chunking is directly related to the validity of contextual information, which in turn affects the accuracy and completeness of the answers generated by LLM. Traditional text chunking methods, such as fixed-size character chunking and recursive text segmentation, expose their inherent limitations, e.g., they may truncate in the middle of a sentence or semantic unit, resulting in context loss and semantic incoherence. In this paper, we will delve into a more intelligent chunking strategy, Agentic Chunking. This approach aims to mimic the human judgment process to create semantically coherent chunks of text, thus significantly improving the performance of RAG systems. In addition, detailed code examples will be provided to help readers get started.
What is Agentic Chunking?
Agentic Chunking is a state-of-the-art LLM-based chunking method that simulates human understanding and judgment during text segmentation, aiming to generate semantically coherent text chunks. The core idea is to focus on the "Agentic" elements in the text, such as characters, organizations, etc., and aggregate the sentences related to these Agentic elements to form meaningful semantic units.
Core Ideas: The essence of Agentic Chunking is that it does not simply rely on character counts or predefined separators to split text. Instead, it leverages the semantic understanding of the LLM to combine semantically closely related sentences into chunks, even if these sentences are not contiguous in position in the original text. This approach captures the intrinsic structure and semantic associations of the text more accurately.
Why do you need Agentic Chunking?
Traditional methods of text chunking have some limitations that are hard to ignore:
- Fixed-Size Character Chunking:
- This approach mechanically splits the text into blocks of a predefined fixed length. This may truncate between characters in the middle of a sentence, or even within words, severely damaging the semantic integrity of the text.
- It completely ignores the intrinsic structure of the document, such as headings, lists, etc., resulting in chunking results that are disconnected from the logical structure of the document.
- Arbitrary segmentation may also mix otherwise unrelated subject matter in the same block of text, further impairing contextual coherence.
- Recursive Text Splitting:
- Recursive text segmentation relies on predefined hierarchical separators such as paragraphs, sentences, words, etc. for segmentation.
- This approach may not be able to effectively handle complex document structures, such as multi-level headings, tables, etc., resulting in loss of structural information.
- It may still truncate in the middle of semantic units such as paragraphs or bulleted lists, affecting semantic integrity.
- Crucially, recursive text segmentation also lacks an in-depth understanding of the semantics of the text, relying only on surface structure for segmentation.
- Semantic Chunking:
- Semantic chunking attempts to group sentences based on the similarity of their embedding vectors, aiming to create semantically relevant chunks.
- However, if sentences within a paragraph differ significantly in semantics, semantic chunking may incorrectly classify these sentences into different chunks, resulting in impaired coherence within the paragraph.
- In addition, semantic chunking usually requires a large number of similarity computations, especially when processing large documents, where the computational cost increases significantly.
Agentic Chunking effectively overcomes the limitations of the traditional methods mentioned above through the following advantages:
- Semantic Coherence: Agentic Chunking generates semantically more meaningful chunks of text, which significantly improves the accuracy of retrieving relevant information.
- Context Preservation: It better preserves contextual coherence within blocks of text, which allows LLM to generate more accurate and contextualized responses.
- Flexibility: The Agentic Chunking method demonstrates a high degree of flexibility and is able to adapt to documents of different lengths, structures, and content types for a wider range of applications.
- Robustness: Agentic Chunking is equipped with a perfect protection mechanism and fallback mechanism to ensure the effectiveness and stability of chunking, even in the case of unusually complex document structure or LLM performance limitations.
How Agentic Chunking Works
The workflow of Agentic Chunking consists of the following key steps:
- Mini-Chunk Creation:
- First, Agentic Chunking uses a recursive text segmentation technique to initially split the input document into smaller micro-chunks. For example, the size of each microchunk can be limited to about 300 characters.
- During the segmentation process, Agentic Chunking takes special care to ensure that miniature chunks are not truncated in the middle of a sentence to preserve basic semantic integrity.
- Marking Mini-Chunks:
- Next, a unique marker is added to each microblock. This marking helps LLM to recognize the boundaries of each microblock in subsequent processing.
- It is important to note that LLM processes text more on the basis of the token rather than exact character counts, but it excels at recognizing structural and semantic patterns in text. Marking microblocks helps the LLM recognize block boundaries, even if it can't count characters exactly.
- LLM-Assisted Chunk Grouping:
- Provide the tagged document to the LLM along with the specific instructions.
- At this point, the task of LLM is to perform an in-depth analysis of the micro-block sequences and combine them into larger, more semantically coherent blocks of text based on semantic relatedness.
- During the grouping process, constraints, such as the maximum number of microblocks contained in each block, can be set to control the size of the blocks according to actual requirements.
- Chunk Assembly:
- Combine the microblocks selected by the LLM to end up with the output of Agentic Chunking -- a text block.
- In order to better manage and utilize these text blocks, relevant metadata can be added to each block, such as the source information of the original document, the index position of the text block in the document, and so on.
- Chunk Overlap for Context Preservation:
In order to ensure the coherence of the context between blocks, the final generated blocks usually have some degree of overlap with the preceding and following micro-blocks. This overlapping mechanism helps LLM to better understand the contextual information when processing neighboring text blocks and avoid information fragmentation. - Guardrails and Fallback Mechanisms:
- Block Size Limit: Forces a maximum block size to be set, ensuring that the generated text blocks are always within the LLM's input length limit, avoiding problems caused by overly long inputs.
- Context Window Management: For very long documents whose length exceeds the LLM context window limit, Agentic Chunking can intelligently split them into multiple manageable parts and process them in batches to ensure processing efficiency and effectiveness.
- Validation: After chunking is complete, Agentic Chunking also performs a validation process to confirm that all microchunks have been correctly included in the final text block to avoid missing information.
- Fallback to Recursive Chunking: When LLM processing fails or is unavailable for any reason, Agentic Chunking gracefully falls back to traditional recursive text chunking methods, ensuring that basic chunking functionality is provided in all cases.
- Parallel Processing: Agentic Chunking supports parallel processing mode, through the use of multi-threading and other technologies, you can significantly accelerate the processing speed of text chunking, especially in the processing of large documents when the advantages are more obvious.
Applications of Agentic Chunking
Agentic Chunking technology shows strong potential for application in a number of areas:
1. Enhanced Learning
- Definition and Explanation: Agentic RAG optimizes the learning process by breaking down complex information into manageable units, thereby enhancing learners' comprehension and retention. This approach focuses specifically on the "Agentic" elements of a text (e.g., characters, organizations), and by organizing information around these core elements, Agentic RAG is able to create more coherent and accessible learning content.
- Role in the Learning Process: Agentic RAG frameworks are playing an increasingly important role in modern educational methods. By using intelligent agents based on RAG technology, educators are able to customize content more flexibly and precisely meet the individual needs of different learners.
- Applications in Education: More and more educational institutions are using Agentic RAG technology to innovate their teaching strategies, develop more engaging and personalized curricula, and improve teaching and learning outcomes.
- Impact on Student Engagement: By presenting information in clearly structured, easy-to-understand chunks of text, Agentic Chunking is effective in increasing student focus and motivation, and stimulating interest in learning.
- Effective Pattern Recognition: In-depth analysis and identification of effective patterns in the use of Agentic RAG systems in educational applications is critical to the continuous optimization of educational outcomes.
2. Improved Information Retention
- Cognitive Processes: Agentic RAG technology leverages the natural tendency of human cognitive processes to organize and correlate information to enhance information retention. The brain prefers to organize data into manageable units, which greatly simplifies the process of information retrieval and recall.
- Improved Memory Recall: By focusing on the "Agentic" elements involved in a text (e.g., individuals or organizations), learners are able to more easily connect the learning material to their existing body of knowledge, and thus recall and consolidate the information learned more effectively.
- Long-Term Retention Strategies: Integrating Agentic RAG technology into daily learning practices helps build effective strategies for continuous learning and knowledge accumulation, enabling long-term knowledge retention and development.
- Practical applications: In areas such as education and business training, Agentic RAG's content presentation can be customized to meet the needs of specific audiences for optimal information delivery and uptake.
3. Efficient Decision-Making
- Business Applications: In the business world, the Agentic RAG system is revolutionizing the decision-making paradigm of business leaders by providing a structured framework for decision-making. It provides a framework that significantly enhances the science of strategic planning and operational efficiency.
- Decision Framework: Agentic RAG is able to break down complex business data and information into smaller, more manageable pieces, helping business decision makers focus on key elements, avoid getting lost in the mass of information, and improve decision-making efficiency.
- Benefits for Business Leaders: Agentic RAG helps business leaders gain a deeper understanding of market trends and customer needs, thus providing more accurate decision support for corporate strategic adjustments and market responses.
- Implementation Steps:
- Identify key business areas where Agentic RAG technology can add value to your organization.
- Develop a customized implementation of Agentic RAG that is highly aligned with the organization's strategic goals.
- Train employees on the Agentic RAG system application to ensure that the system can be effectively implemented and applied.
- Continuously monitor the running effect of Agentic RAG system and adjust the optimization strategy according to the actual application situation to ensure the maximum performance of the system.
Benefits of Agentic Chunking
- Semantic Coherence: Agentic Chunking generates semantically more meaningful chunks of text, significantly improving the accuracy of retrieved information.
- Context Preservation: Agentic Chunking effectively maintains contextual coherence within blocks of text, allowing LLM to generate more accurate and contextualized responses.
- Flexibility: Agentic Chunking demonstrates excellent flexibility in adapting to documents of different lengths, structures and content types.
- Robustness: Agentic Chunking has built-in protection and fallback mechanisms to ensure system stability even in the event of document structure anomalies or LLM performance limitations.
- Adaptability: Agentic Chunking integrates seamlessly with different LLMs and supports fine-tuned optimization for specific application requirements.
Agentic Chunking in action
- Reduced False Assumptions by 92%: The shortcoming of traditional chunking methods is that inaccurate chunking can lead to false assumptions being made by the AI system. Agentic Chunking successfully reduces such errors by a staggering 92%.
- Improved Answer Completeness: Agentic Chunking significantly improves the completeness of the answers, providing users with more comprehensive and accurate responses, and the user experience is greatly enhanced.
Implementation of Agentic Chunking (Python example)
This section will provide an example of Agentic Chunking Python code implementation based on the Langchain framework, and step-by-step detailed explanation of the code to help readers quickly get started.
Prerequisite:
- Ensure that the Langchain and OpenAI Python libraries are installed:
pip install langchain openai
- Configure the OpenAI API key.
Sample code:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain import hub
# 1. Text Propositioning
# Example text
text = """
On July 20, 1969, astronaut Neil Armstrong walked on the moon.
He was leading NASA's Apollo 11 mission.
Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface.
Later, he planted the American flag.
The mission was a success.
"""
# Get the propositionalized prompt template from Langchain hub
obj = hub.pull("wfh/proposal-indexing")
# Use the GPT-4o model
llm = ChatOpenAI(model="gpt-4o")
# Define Pydantic model to extract sentences
class Sentences(BaseModel).
sentences: list[str]
# Create LLM for structured outputs
extraction_llm = llm.with_structured_output(Sentences)
# Create sentence extraction chain
extraction_chain = obj | extraction_llm
# Split text into paragraphs (for simplicity of the example, this article assumes that the input text contains only a paragraph, the actual application can handle multi-paragraph text.)
paragraphs = [text]
propositions = []
for p in paragraphs.
sentences = extraction_chain.invoke(p)
propositions.extend(sentences.sentences)
print("Propositions:", propositions)
# 2. Create LLM Agent
# Define the Chunk Metadata Model
class ChunkMeta(BaseModel).
title: str = Field(description="The title of the chunk.")
summary: str = Field(description="The summary of the chunk.")
# LLM for generating the summary and title (a lower temperature model can be used here)
summary_llm = ChatOpenAI(temperature=0)
# LLM for block allocation
allocation_llm = ChatOpenAI(temperature=0)
# Dictionary to store created text chunks
chunks = {}
# 3. Functions to create new chunks
def create_new_chunk(chunk_id, proposition):
summary_prompt_template = ChatPromptTemplate.from_messages(
[
(
"system".
"Generate a new summary and a title based on the propositions.",
),
(
"user".
"propositions:{propositions}", ), ( "user", "Propositions:{propositions}", (
),
]
)
summary_chain = summary_prompt_template | summary_llm
chunk_meta = summary_chain.invoke(
{
"propositions": [propositions], {
}
)
chunks[chunk_id] = {
"chunk_id": chunk_id, # add chunk_id
"summary": chunk_meta.summary,
"title": chunk_meta.title, "propositions": [propositions]: chunk_meta.
"propositions": [propositions],.
}
return chunk_id # return chunk_id
# 4. Functions to add propositions to an existing chunk
def add_proposition(chunk_id, proposition).
summary_prompt_template = ChatPromptTemplate.from_messages(
[
(
"system".
"If the current_summary and title is still valid for the propositions, return them."
"If not, generate a new summary and a title based on the propositions.",
), (
(
"user".
"current_summary:{current_summary}\ncurrent_title:{current_title}\npropositions:{propositions}", (
),
]
)
summary_chain = summary_prompt_template | summary_llm
chunk = chunks[chunk_id]
current_summary = chunk["summary"]
current_title = chunk["title"]
current_propositions = chunk["propositions"]
all_propositions = current_propositions + [proposition]
chunk_meta = summary_chain.invoke(
{
"current_summary": current_summary,
"current_title": current_title,
"propositions": all_propositions,
}
)
chunk["summary"] = chunk_meta.summary
chunk["title"] = chunk_meta.title
chunk["propositions"] = all_propositions
# 5. Core Logic of Agent
def find_chunk_and_push_proposition(proposition):
class ChunkID(BaseModel).
chunk_id: int = Field(description="The chunk id.")
allocation_prompt = ChatPromptTemplate.from_messages(
[
(
"system", "You have the chunk ids.
"You have the chunk ids and the summaries."
"Find the chunk that best matches the proposition. "
"If no chunk matches, return a new chunk id. "
" "If no chunk matches, return a new chunk id.", "Return only the chunk id.",
), " "Find the chunk that best matches the proposition.
" "Return only the chunk id.", ), (
"user", "proposition:{proposition:{chunk id.", ), (
"proposition:{proposition}\nchunks_summaries:{chunks_summaries}", ), ( "user", ( "user", "chunk_summaries", "chunk_summaries",
), )
]
)
allocation_chain = allocation_prompt | allocation_llm.with_structured_output(ChunkID)
chunks_summaries = {
chunk_id: chunk["summary"] for chunk_id, chunk in chunks.items()
}
# Initial chunks may be empty, causing location_chain.invoke to report an error
if not chunks_summaries.
# If there are no existing chunks, create a new chunk directly
next_chunk_id = 1
create_new_chunk(next_chunk_id, proposition)
return
best_chunk_id = allocation_chain.invoke(
{"proposition": proposition, "chunks_summaries": chunks_summaries}
).chunk_id
if best_chunk_id not in chunks.
# If the returned chunk_id does not exist, create a new chunk
next_chunk_id = max(chunks.keys(), default=0) + 1 if chunks else 1
create_new_chunk(next_chunk_id, proposition)
else.
add_proposition(best_chunk_id, proposition)
# Iterate through the list of propositions and chunk them
for i, proposition in enumerate(propositions):
find_chunk_and_push_proposition(proposition)
# Print the final chunk
print("\nFinal Chunks:")
for chunk_id, chunk in chunks.items():
print(f "Chunk {chunk_id}:")
print(f" Title: {chunk['title']}")
print(f" Summary: {chunk['summary']}")
print(f" Propositions: {chunk['propositions']}")
print("-" * 20)
Code Explanation:
- Propositionalization:
- The code example first uses hub.pull("wfh/proposal-indexing") to load a pre-defined propositionalized prompt template from the Langchain Hub.
- Next, the LLM instance is initialized using ChatOpenAI(model="gpt-4o") and the GPT-4o model is chosen for better performance.
- Define Sentences Pydantic model for structured parsing of the list of sentences output by LLM.
- Constructs a chain of connections between the hint template and the LLM extraction_chain.
- To simplify the example, this article assumes that the input text contains only one paragraph, the actual application can be dealt with multi-paragraph text. Code will be the sample text is divided into a list of paragraphs.
- Loop over the paragraph and use extraction_chain to transform the paragraph into a list of propositions.
- Create the LLM Agent:
- Define the ChunkMeta Pydantic model that defines the block metadata structure (title and summary).
- Create two instances of LLM, summary_llm and allocation_llm. summary_llm is used to generate the summary and title of a block of text, while allocation_llm is responsible for determining into which existing block a proposition should be placed or creating a new block.
- Initializes the chunks dictionary, which is used to store created blocks of text.
- create_new_chunk function:
- The function accepts chunk_id and proposition as input parameters.
- Based on the propositions, the title and summary of the block are generated using summary_prompt_template and summary_llm.
- and stores the new block in the chunks dictionary.
- add_proposition function:
- The function also takes chunk_id and proposition as input.
- Retrieves existing block information from the chunks dictionary.
- Updates the list of propositions for the current block.
- Reassess and update the block title and summary.
- and update the metadata of the corresponding block in the chunks dictionary.
- find_chunk_and_push_proposition function (Agent core logic):
- Define ChunkID Pydantic model for parsing block IDs for LLM output.
- Creates an allocation_prompt that instructs the LLM to find the existing block that best matches the current proposition, or return a new block ID.
- Build allocation_chain, connecting the hint template and allocation_llm.
- Constructs the chunks_summaries dictionary, which stores the ID and summary information for existing blocks.
- If the chunks dictionary is empty (i.e., there aren't any chunks of text yet), a new chunk is created directly.
- Use allocation_chain to call LLM to get the ID of the best matching block.
- If the chunk_id returned by LLM is not in the chunks dictionary, indicating that a new chunk of text needs to be created, the create_new_chunk function is called.
- If the returned chunk_id already exists in the chunks dictionary, indicating that the current proposition should be added to an existing text block, call the add_proposition function.
- Main loop:
- Loop over the list of propositions.
- For each proposition, the find_chunk_and_push_proposition function is called and the proposition is assigned to the appropriate text block.
- Output results:
- Final output of the generated text block, including title, summary and list of propositions included.
Code Improvement Notes:
- Improve the find_chunk_and_push_proposition function by calling the create_new_chunk function directly when the chunks dictionary is empty to avoid potential errors.
- In the create_new_chunk function, a chunk_id key-value pair is added to the chunks[chunk_id] dictionary to explicitly record the block ID.
- Optimize the next_chunk_id generation logic to improve the robustness of the ID generation logic and ensure the correctness of ID generation in different scenarios.
Build vs. Buy
Although Agentic Chunking is only one part of the AI Agent workflow, it is critical for generating semantically coherent chunks of text. There are advantages and disadvantages to building your own Agentic Chunking solution versus purchasing an off-the-shelf solution:
Advantages of self-build:
- High degree of control and customization: The self-built solution allows users to make in-depth customization according to their specific needs, from the design of prompts to the optimization of algorithms, all of which can be perfectly matched with the actual application scenarios.
- Precise targeting: Enterprises can tailor the most appropriate text chunking strategy for optimal performance based on their unique data characteristics and application needs.
Disadvantages of building your own:
- High engineering costs: Building your own Agentic Chunking solution requires specialized knowledge of natural language processing technology and a significant investment of development time, which is costly.
- Unpredictability of LLM Behavior: The behavior of large language models is sometimes difficult to predict and control, which creates technical challenges for self-built solutions.
- Ongoing Maintenance Overhead: Generative AI technology is evolving rapidly, and self-built solutions require ongoing investment in maintenance and updates to keep up with the pace of technology development.
- Production Challenges: It's one thing to get good results in the prototyping phase, but there are still significant challenges to actually applying Agentic Chunking to a production environment and achieving high accuracy rates of 99% or more.
summarize
Agentic Chunking is a powerful text chunking technique that mimics human comprehension and judgment to create semantically coherent chunks of text, thereby significantly improving the performance of RAG systems. Agentic Chunking overcomes many of the limitations of traditional text chunking methods, enabling LLMs to generate more accurate, complete and contextualized answers.
This article helps readers understand the working principle and implementation of Agentic Chunking through detailed code examples and step-by-step explanations. Admittedly, the realization of Agentic Chunking requires a certain amount of technical investment, but the performance improvement and application value it brings is obvious. Agentic Chunking is undoubtedly an effective technical solution for those application scenarios that need to deal with large amounts of text data and have high performance requirements for the RAG system.
FUTURE TRENDS: Future directions for Agentic Chunking may include:
- Deeply integrate with graph databases to construct knowledge graphs, and further develop Graph RAG (Graph Structure Based Retrieval Augmentation Generation) technology to realize deeper knowledge mining and utilization.
- Continuously optimize LLM's hint engineering and instruction design to further improve the accuracy and processing efficiency of text chunking.
- Developing smarter text block merging and splitting strategies to effectively deal with more complex and diverse document structures, and improving the versatility of Agentic Chunking.
- We are actively exploring the application of Agentic Chunking in a wider range of fields, such as intelligent text summarization, high-quality machine translation, and so on, to expand the boundaries of its application.