AI Personal Learning
and practical guidance

zChunk: a generic semantic chunking strategy based on Llama-70B

General Introduction

zChunk is a novel chunking strategy developed by ZeroEntropy to provide a solution for generic semantic chunking. The strategy is based on the Llama-70B model, which optimizes the chunking process of a document by prompting for chunk generation, ensuring that a high signal-to-noise ratio is maintained during information retrieval. zChunk is particularly suited to RAG (Retrieval Augmentation Generation) applications that require high-precision retrieval, and solves the limitations of traditional chunking methods when dealing with complex documents. With zChunk, users can more effectively partition documents into meaningful chunks, thereby improving the accuracy and efficiency of information retrieval.

Your job is to act as a chunker.

You should insert the "paragraph" throughout the input.


Your goal is to separate the content into semantically relevant groupings.

methodology and Limitations of LLM OCR: The Document Parsing Challenge Behind the Glossy Surface The PROMPTs mentioned have some commonalities.

zChunk: a generic semantic chunking strategy based on Llama-70B-1

 

Function List

  • Llama-70B based chunking algorithm: Generating cues for semantic chunking using the Llama-70B model.
  • High signal-to-noise ratio chunking: Optimize the chunking strategy to ensure that the retrieved information has a high signal-to-noise ratio.
  • Multiple chunking strategies: Supports various strategies such as fixed-size chunking, embedding similarity-based chunking, etc.
  • hyperparameter tuning: Provide hyper-parameter tuning pipeline, users can adjust the chunk size and overlap parameter according to specific needs.
  • open source: Full open source code is provided and can be freely used and modified by the user.

 

Using Help

Installation process

  1. clone warehouse::
   git clone https://github.com/zeroentropy-ai/zchunk.git
cd zchunk
  1. Installation of dependencies::
   pip install -r requirements.txt

Usage

  1. Preparing the input file: Save the document to be chunked as a text file, e.g.example_input.txtThe
  2. Run the chunking script::
   python test.py --input example_input.txt --output example_output.txt
  1. Viewing the output file: The chunking results will be saved in theexample_output.txtCenter.

Detailed function operation flow

  1. Choosing a chunking strategy::
    • NaiveChunk: Fixed-size chunking for simple documents.
    • SemanticChunk: Chunking based on embedding similarity for documents that need to maintain semantic integrity.
    • zChunk Algorithm: Generate chunks based on hints from the Llama-70B model for complex documents.
  2. Adjustment of hyperparameters::
    • chunk size: This can be done by adjusting the parameterchunk_sizeto set the size of each chunk.
    • overlap ratio: via the parameteroverlap_ratioSet the percentage of overlap between chunks to ensure continuity of information.
  3. Running hyperparameter tuning::
   python hyperparameter_tuning.py --input example_input.txt --output tuned_output.txt

The script will automatically adjust the chunk size and overlap ratio based on the input document to generate optimal chunking results.

  1. Evaluating the effects of chunking::
    • Evaluate the chunking results using the provided evaluation script to ensure the effectiveness of the chunking strategy.
   python evaluate.py --input example_input.txt --output example_output.txt

typical example

Suppose we have a text of the U.S. Constitution that needs to be chunked:

Original text:

Section. 1.
All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.
Section. 2.
The House of Representatives shall be composed of Members chosen every second Year by the People of the several States, and the Electors in each State shall have the Qualifications requisite for Electors. have the Qualifications requisite for Electors of the most numerous Branch of the State Legislature.
No Person shall be a Representative who shall not have attained to the Age of twenty five Years, and been seven Years a Citizen of the United States, and who shall not, when elected, be an Inhabitant of that State in which he shall be chosen.

Chunking is performed using the zChunk algorithm:

  1. Select Cue Words: Select a special token (e.g., "paragraph") that is not in the corpus.
  2. Insert Cues: Have Llama insert the token in the user message.
   SYSTEM_PROMPT (simplified):
Your task is to act as a chunker.
You should insert "segment" tags in the input.
Your goal is to divide the content into semantically related groups.
  1. Generate chunks::
   Section. 1.
All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.
Section. 2.
The House of Representatives shall be composed of Members chosen every second Year by the People of the several States, and the Electors in each State shall have the Qualifications requisite for their election. have the Qualifications requisite for Electors of the most numerous Branch of the State Legislature.
No Person shall be a Representative who shall not have attained to the Age of twenty five Years, and been seven Years a Citizen of the United States, and who shall not, when elected, be an Inhabitant of that State in which he shall be chosen.

In this way, we can segment the document into semantically related blocks, each of which can be retrieved independently, improving the signal-to-noise ratio and accuracy of information retrieval.

make superior

  • With local inference Llama, entire passages can be processed efficiently and logprobs can be examined to determine chunk locations.
  • Processing 450,000 characters takes about 15 minutes, but can be significantly reduced if the code is optimized.

benchmarking

  • zChunk has higher retrieval ratio and signal ratio scores than NaiveChunk and semantic chunking methods on the LegalBenchConsumerContractsQA dataset.

With the zChunk algorithm, we can easily segment any type of document without relying on regular expressions or manually created rules, improving the efficiency and accuracy of RAG applications.

CDN
May not be reproduced without permission:Chief AI Sharing Circle " zChunk: a generic semantic chunking strategy based on Llama-70B

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish