Text Extraction API (text-extract-api): visual extraction of text information, anonymized PDF extraction tool

🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

General Introduction

The Text Extraction API (text-extract-api) is a powerful tool designed to extract and parse content from a variety of document formats (e.g. PDF, Word, PPTX, etc.). The API utilizes state-of-the-art Optical Character Recognition (OCR) technology and Ollama-supported models to be able to convert any document or image into structured JSON or Markdown formats. Its key features include highly accurate text extraction, removal of personally identifiable information (PII), support for multiple storage strategies, and distributed task processing. The text extraction API is built with FastAPI and uses Celery for asynchronous task processing and Redis for caching OCR results to ensure an efficient and reliable document processing experience.

pdf-extract-api is a document extraction and parsing API that supports document anonymization using state-of-the-art OCR technology and models supported by Ollama. It can convert any document or image into structured JSON or Markdown , supports high-precision extraction of tabular data , numbers and mathematical formulas . Built on FastAPI, the API uses Celery for asynchronous task processing and Redis to cache OCR results, ensuring efficient and reliable document processing.

Text Extraction API (text-extract-api): visual extraction of text information, anonymized PDF extraction tool-1

pdf-extract-api: PDF documents or images converted to JSON/Markdown, automatically erased personal information-1

Function List

High-precision OCR: Use PyTorch, Marker, Llama3.2-vision and other OCR strategies to realize high-precision text extraction.
Document conversion: support for PDF, Word, PPTX and other documents into Markdown or JSON format.
Remove PII: Automatically recognizes and removes personally identifiable information from documents.
Distributed processing: Use Celery for distributed task processing to improve processing efficiency.
Caching mechanism: Use Redis to cache OCR results to reduce repeated processing time.
Multi-storage strategy: Support local file system, Google Drive and other storage methods.
CLI tools: Provide command line tools to facilitate users to send tasks and process the results.

Using Help

Installation process

Download and install Ollama.
Download and install Docker.
Clone the text-extract-api repository:

   git clone https://github.com/CatchTheTornado/text-extract-api.git

Go to the project directory and start the Docker container:

   cd text-extract-api
docker-compose up

Usage

Document Conversion

Upload the documents to be converted to the specified directory.
Use the CLI tool to send conversion tasks:

   python client/cli.py ocr_upload --file examples/example.pdf --prompt_file examples/example-to-json-prompt.txt

The conversion result will be saved in JSON or Markdown format in the specified directory.

Removal of PII

Upload a document that contains PII.
Use the CLI tool to send removal PII tasks:

   python client/cli.py ocr_upload --file examples/example-pii.pdf --prompt_file examples/example-remove-pii.txt

Processed documents will have all personally identifiable information removed.

Detailed function operation flow

High Precision OCR: By configuring different OCR strategies (e.g. Marker, Llama3.2-vision, etc.), it can realize high-precision text extraction for various documents. Users can choose the most suitable OCR strategy according to the document type.
Document Conversion: Support for PDF, Word, PPTX and other formats of the document will be converted to Markdown or JSON format, to facilitate subsequent data processing and analysis.
Removal of PII: Automatically recognizes and removes personally identifiable information from documents to ensure data privacy and security.
distributed processing: Use Celery for distributed task processing to support large-scale document processing tasks and improve processing efficiency.
caching mechanism: Use Redis to cache OCR results to reduce repetitive processing time and improve system response time.
Multi-Storage Policy: It supports various storage methods such as local file system, Google Drive, etc. Users can choose the appropriate storage strategy according to their needs.
CLI tools: Command line tools are provided so that users can send tasks and process results with simple commands for convenience.

Text Extraction API (text-extract-api): visual extraction of text information, anonymized PDF extraction tool

General Introduction

Function List

Using Help

Installation process

Usage

Document Conversion

Removal of PII

Detailed function operation flow

Related articles

Recommended

Can't find AI tools? Try here!

FLUX.1 image generator (supports Chinese input)

Recent AI Hotspots

AI Tools Recommendations

AI Tools Classification