zChunk: Llama-70Bに基づく一般的な意味的チャンキング戦略

1.4K 00

はじめに

zChunkは、ZeroEntropy社が開発した、汎用的なセマンティック・チャンキングのためのソリューションを提供する新しいチャンキング戦略です。このストラテジーはLlama-70Bモデルに基づいており、チャンク生成を促すことでドキュメントのチャンキングプロセスを最適化し、情報検索時に高いS/N比が維持されるようにします。zChunkは特に、高精度の検索を必要とするRAG（Retrieval Augmentation Generation）アプリケーションに適しており、複雑なドキュメントを扱う際の従来のチャンキング手法の限界を解決します。zChunk を使用することで、ユーザーはより効果的にドキュメントを意味のあるチャンクに分割することができ、情報検索の精度と効率が向上します。

君の仕事はチャンカーとして行動することだ。
段落」は、入力の至るところに挿入する必要がある。
あなたの目標は、コンテンツを意味的に関連性のあるグループに分けることです。
方法論と LLM OCRの限界：華やかさの下にある文書解析の課題紹介したPROMPTにはいくつかの共通点がある。

機能一覧

Llama-70BベースのチャンキングアルゴリズムLlama-70Bモデルを用いた意味チャンキングのためのキューの生成。
高S/N比のチャンキングチャンキング戦略を最適化し、検索された情報が高いS/N比を持つようにする。
複数のチャンキング戦略固定サイズのチャンキング、埋め込み類似度ベースのチャンキングなど、様々な戦略をサポートします。
ハイパーパラメータ・チューニング: ハイパーパラメータチューニングパイプラインを提供し、ユーザーが特定のニーズに応じてチャンクサイズやオーバーラップパラメータを調整できるようにします。
オープンソース完全なオープンソースコードが提供され、ユーザーが自由に使用および変更することができます。

ヘルプの使用

設置プロセス

クローン倉庫::

   git clone https://github.com/zeroentropy-ai/zchunk.git
cd zchunk

依存関係のインストール::

   pip install -r requirements.txt

使用方法

入力ファイルの準備チャンクする文書をテキストファイルとして保存します。example_input.txt.
チャンキング・スクリプトの実行::

   python test.py --input example_input.txt --output example_output.txt

出力ファイルの表示チャンキング結果はexample_output.txt真ん中だ。

詳細な機能操作の流れ

チャンキング戦略の選択::
- ナイーブチャンクシンプルな文書には固定サイズのチャンク。
- セマンティックチャンク意味的整合性を維持する必要がある文書のための、埋め込み類似度に基づくチャンキング。
- zチャンク・アルゴリズム複雑な文書に対して、Llama-70Bモデルからのヒントに基づいてチャンクを生成する。
ハイパーパラメータの調整::
- チャンクサイズこれは、パラメーターchunk_sizeで各チャンクのサイズを設定する。
- オーバーラップ率パラメーターoverlap_ratioチャンク間の重複率を設定し、情報の連続性を確保する。
ハイパーパラメータ・チューニングの実行::

   python hyperparameter_tuning.py --input example_input.txt --output tuned_output.txt

このスクリプトは、入力文書に基づいてチャンクサイズとオーバーラップ率を自動的に調整し、最適なチャンキング結果を生成する。

チャンキングの効果を評価する::
- 提供された評価スクリプトを使ってチャンキング結果を評価し、チャンキング戦略の有効性を確認する。

   python evaluate.py --input example_input.txt --output example_output.txt

典型例

チャンク化が必要な合衆国憲法のテキストがあるとする：

原文ママ

Section. 1.
All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.
Section. 2.
The House of Representatives shall be composed of Members chosen every second Year by the People of the several States, and the Electors in each State shall have the Qualifications requisite for Electors of the most numerous Branch of the State Legislature.
No Person shall be a Representative who shall not have attained to the Age of twenty five Years, and been seven Years a Citizen of the United States, and who shall not, when elected, be an Inhabitant of that State in which he shall be chosen.

zChunkアルゴリズムによるチャンキング：

キュー・ワードを選ぶコーパスにない特別なトークン（例えば "パラグラフ"）を選択する。
キューの挿入Llamaにトークンをユーザーメッセージに挿入させる。

   SYSTEM_PROMPT (简化版)：
你的任务是作为一个分块器。
你应该在输入中插入“段”标记。
你的目标是将内容分成语义相关的组。

チャンクの生成::

   Section. 1.
All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.段
Section. 2.
The House of Representatives shall be composed of Members chosen every second Year by the People of the several States, and the Electors in each State shall have the Qualifications requisite for Electors of the most numerous Branch of the State Legislature.段
No Person shall be a Representative who shall not have attained to the Age of twenty five Years, and been seven Years a Citizen of the United States, and who shall not, when elected, be an Inhabitant of that State in which he shall be chosen.段

このようにして、文書を意味的に関連するブロックに分割し、それぞれを独立して検索できるようにすることで、S/N比と情報検索の精度を向上させることができる。