AI Personal Learning
and practical guidance

pdf-extract-api: can run locally personal information anonymized PDF extraction tools

General Introduction

pdf-extract-api is a document extraction and parsing API that supports document anonymization using state-of-the-art OCR technology and models supported by Ollama. It can convert any document or image into structured JSON or Markdown , supports high-precision extraction of tabular data , numbers and mathematical formulas . Built on FastAPI, the API uses Celery for asynchronous task processing and Redis to cache OCR results, ensuring efficient and reliable document processing.

pdf-extract-api: PDF documents or images converted to JSON/Markdown, automatically erased personal information-1


 

Function List

  • Highly accurate PDF to Markdown conversion
  • PDF to JSON Conversion
  • Improve OCR results using LLM (e.g. LLama 3.1)
  • Deletion of Personally Identifiable Information (PII)
  • Distributed queue processing (using Celery)
  • Results caching (using Redis)
  • CLI tools for sending tasks and processing results

 

Using Help

Installation process

  1. clone warehouse::
    git clone https://github.com/CatchTheTornado/pdf-extract-api.git
    cd pdf-extract-api
    
2. **Install dependencies** :
Ensure that Docker and Docker Compose are installed, then run the following command:
```bash
docker-compose up

Usage Process

  1. Convert PDF to Markdown ::
    Use CLI tools to send tasks and process the results, for example:
python client/cli.py ocr --file examples/example-mri.pdf --prompt_file examples/example-mri-2-json-prompt.txt

This will convert PDF files to Markdown format.

  1. Convert PDF to JSON and Remove PII ::
python client/cli.py ocr --file examples/example-invoice.pdf --prompt_file examples/example-invoice-remove-pii.txt

This will convert the PDF file to JSON format and remove personally identifiable information.

  1. Caching OCR results ::
    Use Redis to cache OCR results to improve processing efficiency.

Detailed Operation Procedure

  • Starting services : Ensure that Docker containers are running properly and that OCR tasks can be sent via the CLI tool once the service has started.
  • Sending tasks : Use the CLI tool to send OCR tasks, specifying the input file and conversion format.
  • Outcome of the process : When the task is completed, the results are output in the specified format and can be used directly or processed further.
AI Easy Learning

The layman's guide to getting started with AI

Help you learn how to utilize AI tools at a low cost and from a zero base.AI, like office software, is an essential skill for everyone. Mastering AI will give you an edge in your job search and half the effort in your future work and studies.

View Details>
May not be reproduced without permission:Chief AI Sharing Circle " pdf-extract-api: can run locally personal information anonymized PDF extraction tools

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish