General Introduction
pdf-extract-api is a document extraction and parsing API that supports document anonymization using state-of-the-art OCR technology and models supported by Ollama. It can convert any document or image into structured JSON or Markdown , supports high-precision extraction of tabular data , numbers and mathematical formulas . Built on FastAPI, the API uses Celery for asynchronous task processing and Redis to cache OCR results, ensuring efficient and reliable document processing.
Function List
- Highly accurate PDF to Markdown conversion
- PDF to JSON Conversion
- Improve OCR results using LLM (e.g. LLama 3.1)
- Deletion of Personally Identifiable Information (PII)
- Distributed queue processing (using Celery)
- Results caching (using Redis)
- CLI tools for sending tasks and processing results
Using Help
Installation process
- clone warehouse::
git clone https://github.com/CatchTheTornado/pdf-extract-api.git cd pdf-extract-api
2. **Install dependencies** :
Ensure that Docker and Docker Compose are installed, then run the following command:
```bash
docker-compose up
Usage Process
- Convert PDF to Markdown ::
Use CLI tools to send tasks and process the results, for example:
python client/cli.py ocr --file examples/example-mri.pdf --prompt_file examples/example-mri-2-json-prompt.txt
This will convert PDF files to Markdown format.
- Convert PDF to JSON and Remove PII ::
python client/cli.py ocr --file examples/example-invoice.pdf --prompt_file examples/example-invoice-remove-pii.txt
This will convert the PDF file to JSON format and remove personally identifiable information.
- Caching OCR results ::
Use Redis to cache OCR results to improve processing efficiency.
Detailed Operation Procedure
- Starting services : Ensure that Docker containers are running properly and that OCR tasks can be sent via the CLI tool once the service has started.
- Sending tasks : Use the CLI tool to send OCR tasks, specifying the input file and conversion format.
- Outcome of the process : When the task is completed, the results are output in the specified format and can be used directly or processed further.