General Introduction
ExtractThinker is a flexible document intelligence tool that utilizes Large Language Models (LLMs) to extract and classify structured data from documents, providing a seamless ORM-like document processing workflow. It supports a variety of document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, and Google Document AI, among others. Users can define custom extraction contracts using Pydantic models for accurate data extraction. The tool also supports asynchronous processing, multi-format document processing (e.g., PDF, images, spreadsheets, etc.), and integrates with a variety of LLM providers (e.g., OpenAI, Anthropic, Cohere, etc.).
Function List
- Flexible Document Loader: Support for multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, and Google Document AI.
- Customized withdrawal contracts: Define custom extraction contracts using Pydantic models for accurate data extraction.
- Advanced Classification: Classify documents or document sections using custom classifications and policies.
- asynchronous processing: Efficient processing of large documents using asynchronous processing.
- Multi-format support: Seamlessly handle a variety of document formats such as PDF, images, spreadsheets, and more.
- ORM style interactions: Interacts with documentation and LLMs in ORM style for easy development.
- segmentation strategy: Implement lazy or eager segmentation strategies to process documents by page or as a whole.
- Integration with LLM: Easily integrate with different LLM providers (e.g. OpenAI, Anthropic, Cohere, etc.).
Using Help
Installation process
- Install ExtractThinker: Install ExtractThinker using pip:
pip install extract_thinker
Guidelines for use
Basic Extraction Example
The following example demonstrates how to use PyPdf to load a document and extract specific fields defined in a contract:
import os
from dotenv import load_dotenv
from extract_thinker import Extractor, DocumentLoaderPyPdf, Contract
load_dotenv()
class InvoiceContract(Contract).
invoice_number: str
invoice_date: str
# Set the path to the Tesseract executable file
test_file_path = os.path.join("path_to_your_files", "invoice.pdf")
# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini") # or any other supported models
# extract data from the document
result = extractor.extract(test_file_path, InvoiceContract)
print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)
Categorization Example
ExtractThinker allows to categorize documents or document sections using custom classifications:
import os
from dotenv import load_dotenv
from extract_thinker import Extractor, Classification, Process, ClassificationStrategy
load_dotenv()
class CustomClassification(Classification).
Category: str
# Initialize the extractor
extractor = Extractor()
extractor.load_classification_strategy(ClassificationStrategy.CUSTOM)
# Define the classification strategy
classification = CustomClassification(category="Invoice")
# Classify data from documents
result = extractor.classify(test_file_path, classification)
print("Category:", result.category)
Detailed function operation flow
- Loading Documents: Load documents using supported document loaders (e.g. PyPdf, Tesseract OCR, etc.).
- Definition of withdrawal contracts: Define a custom extraction contract using the Pydantic model, specifying the fields to be extracted.
- Initialize the extractor: Create an Extractor instance and load the document loader and LLM model.
- Extract data: Call
extract
method extracts data from the document and returns results based on contractually defined fields. - Category Documents: To classify a document or part of a document using a custom classification policy, call the
classify
method to get the classification results.
With the above steps, users can efficiently extract and classify data from documents of various formats and optimize the document processing flow.