ExtractThinker：提取和分类文档为结构化数据，优化文档处理流程

61.2K 00

综合介绍

ExtractThinker 是一个灵活的文档智能工具，利用大型语言模型（LLMs）从文档中提取和分类结构化数据，提供类似 ORM 的无缝文档处理工作流。它支持多种文档加载器，包括 Tesseract OCR、Azure Form Recognizer、AWS Textract 和 Google Document AI 等。用户可以使用 Pydantic 模型定义自定义提取合同，实现精确的数据提取。该工具还支持异步处理、多格式文档处理（如 PDF、图像、电子表格等），并与多种 LLM 提供商（如 OpenAI、Anthropic、Cohere 等）集成。

功能列表

灵活的文档加载器：支持多种文档加载器，包括 Tesseract OCR、Azure Form Recognizer、AWS Textract 和 Google Document AI。
自定义提取合同：使用 Pydantic 模型定义自定义提取合同，实现精确的数据提取。
高级分类：使用自定义分类和策略对文档或文档部分进行分类。
异步处理：利用异步处理高效处理大文档。
多格式支持：无缝处理各种文档格式，如 PDF、图像、电子表格等。
ORM 风格交互：以 ORM 风格与文档和 LLM 进行交互，便于开发。
分割策略：实现懒惰或急切的分割策略，按页或整体处理文档。
与 LLM 集成：轻松与不同的 LLM 提供商（如 OpenAI、Anthropic、Cohere 等）集成。

使用帮助

安装流程

安装 ExtractThinker：使用 pip 安装 ExtractThinker：

   pip install extract_thinker

使用指南

基本提取示例

以下示例演示如何使用 PyPdf 加载文档并提取合同中定义的特定字段：

import os
from dotenv import load_dotenv
from extract_thinker import Extractor, DocumentLoaderPyPdf, Contract
load_dotenv()
class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
# 设置 Tesseract 可执行文件的路径
test_file_path = os.path.join("path_to_your_files", "invoice.pdf")
# 初始化提取器
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")  # 或任何其他支持的模型
# 从文档中提取数据
result = extractor.extract(test_file_path, InvoiceContract)
print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)

分类示例

ExtractThinker 允许使用自定义分类对文档或文档部分进行分类：

import os
from dotenv import load_dotenv
from extract_thinker import Extractor, Classification, Process, ClassificationStrategy
load_dotenv()
class CustomClassification(Classification):
category: str
# 初始化提取器
extractor = Extractor()
extractor.load_classification_strategy(ClassificationStrategy.CUSTOM)
# 定义分类策略
classification = CustomClassification(category="Invoice")
# 从文档中分类数据
result = extractor.classify(test_file_path, classification)
print("Category:", result.category)