Chonkie: Lightweight RAG Text Chunking Library

General Introduction

Chonkie is a lightweight and efficient RAG (Retrieval-Augmented Generation) text chunking library designed to help developers quickly and easily chunk text. The library supports a variety of chunking methods, including token-, word-, sentence-, and semantic similarity-based chunking, and is suitable for a variety of text processing and natural language processing tasks. Default installation requires only 21MB (other similar products require 80-171MB) Supports all major chunkers.

Function List

TokenChunker: Splits text into fixed-size marker blocks.
WordChunker: Divide text into chunks based on words.
SentenceChunker: Divide the text into chunks based on sentences.
SemanticChunker: Split text into chunks based on semantic similarity.
SDPMChunker: Segmentation of text using a semantic double merge approach.

Using Help

mounting

To install Chonkie, simply run the following command:

pip install chonkie

Chonkie follows the principle of minimizing default installations and recommends installing specific chunkers as needed, or all of them if you don't want to consider dependencies (not recommended).

pip install chonkie[all]

utilization

Here is a basic example to help you get started quickly:

First import the desired chunker:
```
from chonkie import TokenChunker
```
Import your favorite tokenizer library (AutoTokenizers, TikToken and AutoTikTokenizer are supported):
```
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("gpt2")
```
Initialize the chunker:
```
chunker = TokenChunker(tokenizer)
```

Chunking the text:

chunks = chunker("Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe.")

Access the chunking results:

for chunk in chunks:
print(f "Chunk: {chunk.text}")
print(f"Tokens: {chunk.token_count}")

Supported Methods

Chonkie offers a wide range of chunkers to help you efficiently create and organize your own chunks for the RAG The application splits the text. The following is a brief overview of the available chunkers:

TokenChunker: Splits text into fixed-size marker blocks.
WordChunker: Divide text into chunks based on words.
SentenceChunker: Divide the text into chunks based on sentences.
SemanticChunker: Split text into chunks based on semantic similarity.
SDPMChunker: Segmentation of text using a semantic double merge approach.

benchmarking

Chonkie performs well in several benchmarks:

sizes: The default installation is only 9.7MB (compared to 80-171MB for other versions), which is still lighter than the competition, even when semantic chunking is included.
tempo: Tag chunking is 33x faster than the slowest alternative, sentence chunking is nearly 2x faster than the competition, and semantic chunking is 2.5x faster than other methods.

Detailed Operation Procedure

installer: Install Chonkie and the required tokenizer libraries via pip.
import library: Import Chonkie and the tagger library in your Python scripts.
Initializing the chunker: Select and initialize the appropriate chunker for your needs.
chunked text: Chunks the text using an initialized chunker.
Outcome of the process: Iterate through the chunking results for further processing or analysis.

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Chonkie: a lightweight RAG text chunking library

General Introduction

Function List

Using Help

mounting

utilization

Supported Methods

benchmarking

Detailed Operation Procedure

Related articles

MiniRAG: Simplified Retrieval Enhanced Generation Framework, Entity Graph Index Recall Relevant Text Blocks

Omni-RGPT: A Multimodal Large Model for Image and Video Region-Level Understanding to Enhance Visual Content Analysis

Bailing: a low-latency open source voice dialog assistant that easily realizes natural conversational exchanges

WikiChat: A Chat Tool for Retrieving Knowledge Using Wikipedia Data

OpenAI Edge TTS: Free text-to-speech API utilizing Edge TTS, compatible with OpenAI formats

Recommended

Can't find AI tools? Try here!

FLUX.1 image generator (supports Chinese input)

New Releases

Popular Articles

Hot Tags.

Chief AI Sharing Circle