General Introduction
The Text Extraction API (text-extract-api) is a powerful tool designed to extract and parse content from a variety of document formats (e.g. PDF, Word, PPTX, etc.). The API utilizes state-of-the-art Optical Character Recognition (OCR) technology and Ollama-supported models to be able to convert any document or image into structured JSON or Markdown formats. Its key features include highly accurate text extraction, removal of personally identifiable information (PII), support for multiple storage strategies, and distributed task processing. The text extraction API is built with FastAPI and uses Celery for asynchronous task processing and Redis for caching OCR results to ensure an efficient and reliable document processing experience.
pdf-extract-api is a document extraction and parsing API that supports document anonymization using state-of-the-art OCR technology and models supported by Ollama. It can convert any document or image into structured JSON or Markdown , supports high-precision extraction of tabular data , numbers and mathematical formulas . Built on FastAPI, the API uses Celery for asynchronous task processing and Redis to cache OCR results, ensuring efficient and reliable document processing.
Function List
- High-precision OCR: Use PyTorch, Marker, Llama3.2-vision and other OCR strategies to realize high-precision text extraction.
- Document conversion: support for PDF, Word, PPTX and other documents into Markdown or JSON format.
- Remove PII: Automatically recognizes and removes personally identifiable information from documents.
- Distributed processing: Use Celery for distributed task processing to improve processing efficiency.
- Caching mechanism: Use Redis to cache OCR results to reduce repeated processing time.
- Multi-storage strategy: Support local file system, Google Drive and other storage methods.
- CLI tools: Provide command line tools to facilitate users to send tasks and process the results.
Using Help
Installation process
- Download and install Ollama.
- Download and install Docker.
- Clone the text-extract-api repository:
git clone https://github.com/CatchTheTornado/text-extract-api.git
- Go to the project directory and start the Docker container:
cd text-extract-api
docker-compose up
Usage
Document Conversion
- Upload the documents to be converted to the specified directory.
- Use the CLI tool to send conversion tasks:
python client/cli.py ocr_upload --file examples/example.pdf --prompt_file examples/example-to-json-prompt.txt
- The conversion result will be saved in JSON or Markdown format in the specified directory.
Removal of PII
- Upload a document that contains PII.
- Use the CLI tool to send removal PII tasks:
python client/cli.py ocr_upload --file examples/example-pii.pdf --prompt_file examples/example-remove-pii.txt
- Processed documents will have all personally identifiable information removed.
Detailed function operation flow
- High Precision OCR: By configuring different OCR strategies (e.g. Marker, Llama3.2-vision, etc.), it can realize high-precision text extraction for various documents. Users can choose the most suitable OCR strategy according to the document type.
- Document Conversion: Support for PDF, Word, PPTX and other formats of the document will be converted to Markdown or JSON format, to facilitate subsequent data processing and analysis.
- Removal of PII: Automatically recognizes and removes personally identifiable information from documents to ensure data privacy and security.
- distributed processing: Use Celery for distributed task processing to support large-scale document processing tasks and improve processing efficiency.
- caching mechanism: Use Redis to cache OCR results to reduce repetitive processing time and improve system response time.
- Multi-Storage Policy: It supports various storage methods such as local file system, Google Drive, etc. Users can choose the appropriate storage strategy according to their needs.
- CLI tools: Command line tools are provided so that users can send tasks and process results with simple commands for convenience.