AI Personal Learning
and practical guidance

ExtractThinker: extracting and classifying documents into structured data to optimize the document processing flow

General Introduction

ExtractThinker is a flexible document intelligence tool that utilizes Large Language Models (LLMs) to extract and classify structured data from documents, providing a seamless ORM-like document processing workflow. It supports a variety of document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, and Google Document AI, among others. Users can define custom extraction contracts using Pydantic models for accurate data extraction. The tool also supports asynchronous processing, multi-format document processing (e.g., PDF, images, spreadsheets, etc.), and integrates with a variety of LLM providers (e.g., OpenAI, Anthropic, Cohere, etc.).

ExtractThinker: Extracting and Classifying Documents as Structured Data to Optimize Document Processing-1


 

Function List

  • Flexible Document Loader: Support for multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, and Google Document AI.
  • Customized withdrawal contracts: Define custom extraction contracts using Pydantic models for accurate data extraction.
  • Advanced Classification: Classify documents or document sections using custom classifications and policies.
  • asynchronous processing: Efficient processing of large documents using asynchronous processing.
  • Multi-format support: Seamlessly handle a variety of document formats such as PDF, images, spreadsheets, and more.
  • ORM style interactions: Interacts with documentation and LLMs in ORM style for easy development.
  • segmentation strategy: Implement lazy or eager segmentation strategies to process documents by page or as a whole.
  • Integration with LLM: Easily integrate with different LLM providers (e.g. OpenAI, Anthropic, Cohere, etc.).

 

Using Help

Installation process

  1. Install ExtractThinker: Install ExtractThinker using pip:
   pip install extract_thinker

Guidelines for use

Basic Extraction Example

The following example demonstrates how to use PyPdf to load a document and extract specific fields defined in a contract:

import os
from dotenv import load_dotenv
from extract_thinker import Extractor, DocumentLoaderPyPdf, Contract
load_dotenv()
class InvoiceContract(Contract).
invoice_number: str
invoice_date: str
# Set the path to the Tesseract executable file
test_file_path = os.path.join("path_to_your_files", "invoice.pdf")
# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini") # or any other supported models
# extract data from the document
result = extractor.extract(test_file_path, InvoiceContract)
print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)

Categorization Example

ExtractThinker allows to categorize documents or document sections using custom classifications:

import os
from dotenv import load_dotenv
from extract_thinker import Extractor, Classification, Process, ClassificationStrategy
load_dotenv()
class CustomClassification(Classification).
Category: str
# Initialize the extractor
extractor = Extractor()
extractor.load_classification_strategy(ClassificationStrategy.CUSTOM)
# Define the classification strategy
classification = CustomClassification(category="Invoice")
# Classify data from documents
result = extractor.classify(test_file_path, classification)
print("Category:", result.category)

Detailed function operation flow

  1. Loading Documents: Load documents using supported document loaders (e.g. PyPdf, Tesseract OCR, etc.).
  2. Definition of withdrawal contracts: Define a custom extraction contract using the Pydantic model, specifying the fields to be extracted.
  3. Initialize the extractor: Create an Extractor instance and load the document loader and LLM model.
  4. Extract data: Call extract method extracts data from the document and returns results based on contractually defined fields.
  5. Category Documents: To classify a document or part of a document using a custom classification policy, call the classify method to get the classification results.

With the above steps, users can efficiently extract and classify data from documents of various formats and optimize the document processing flow.

May not be reproduced without permission:Chief AI Sharing Circle " ExtractThinker: extracting and classifying documents into structured data to optimize the document processing flow

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish