PDF-Extract-Kit: extract the complex structure of PDF content of open source tools

Latest AI Resources6mos agorelease AI Sharing Circle

1.7K 00

General Introduction

PDF-Extract-Kit is an open source project developed by the OpenDataLab team , focusing on the efficient extraction of high-quality content from complex and diverse PDF documents . It integrates advanced document parsing technology , support for layout detection , formula recognition , table extraction and OCR and other functions , applicable to academic papers , research reports , financial documents and other scenarios. The tool adopts a modular design, users can be flexibly configured according to the needs of the user to easily build customized document processing applications. PDF-Extract-Kit provides comprehensive evaluation benchmarks to help users choose the most suitable model, while constantly updating and optimizing, such as the recent addition of a faster DocLayout-YOLO and support for multiple formats of the output of StructTable-InternVL2-1, which can be used in a variety of applications. -InternVL2-1B. Both developers and researchers can realize efficient document content extraction through it.

Function List

Layout Detection: Recognize page layouts in PDF, including areas such as headings, paragraphs, images and tables, with support for efficient models such as DocLayout-YOLO.
formula recognition: Extracts and parses mathematical formulas from documents and converts them to LaTeX format, relying on advanced technologies such as UniMERNet.
Form Extraction: Supports recognition and extraction of complex table content, with output in LaTeX, HTML and Markdown formats.
OCR Processing: Convert text in scanned documents or images into editable text through technologies such as PaddleOCR.
Modular Configuration: Provides flexible profiles that allow users to combine different models and quickly build applications.
Content evaluation: A variety of built-in PDF parsing benchmarks to help users evaluate the effectiveness of different models.
Image and Text Extraction: Support for extracting images from PDFs and recognizing their text content.

Using Help

Installation process

PDF-Extract-Kit is supported on multiple operating systems (e.g. Ubuntu, Windows or macOS), here are the detailed installation steps (Ubuntu 20.04 for example):

1. Environmental preparation

Make sure Python 3.10 is installed on your system:

sudo apt update
sudo apt install python3.10 python3.10-dev python3-pip

Create and activate a virtual environment:

conda create -n pdf-extract-kit python=3.10
conda activate pdf-extract-kit

2. Installation of dependencies

Clone the code repository:

git clone https://github.com/opendatalab/PDF-Extract-Kit.git
cd PDF-Extract-Kit

Install core dependencies (available if no GPU) requirements-cpu.txt):
```
pip install -r requirements.txt
```
take note of: If you encounter doclayout-yolo Installation failed, you can install it manually:
```
pip3 install doclayout-yolo==0.0.2 --extra-index-url=https://pypi.org/simple
```

3. Download model weights

Refer to the official tutorial to download the model files (full or partial download is supported):
- Automated downloads using Python scripts:
```
python scripts/download_models_hf.py
```
- Or download it manually from Hugging Face:
```
git lfs install
git clone https://huggingface.co/opendatalab/PDF-Extract-Kit-1.0
```
After the download is complete, place the model files in the specified path in the project directory (refer to the configs/model_configs.yaml).

4. Verification of installation

Run the sample script to test that the environment is working:
```
python pdf_extract.py --pdf assets/examples/example.pdf
```
The output will be saved in the outputs folder.

Functional operation flow

Layout Detection

Preparing PDF files: Place the PDF to be processed into the project directory (e.g. assets/examples/).

Running Layout Inspection::

modifications configs/layout_detection.yaml The input path in the

pdf_path: "assets/examples/example.pdf"
output_dir: "outputs/layout_detection"

Execute the command:

python scripts/layout_detection.py --config=configs/layout_detection.yaml

View Results: in outputs/layout_detection folder, generating an image and JSON file with the layout area labeled.

formula recognition

Run formula extraction::
- Use the default configuration:
```
python pdf_extract.py --pdf your_file.pdf --render
```
- --render parameter renders the formula as an image for easy verification.
View Output: Formulas are stored in output JSON in LaTeX format and can be used directly in academic writing or further processing.

Form Extraction

Execution Form Recognition::
- Make sure it's downloaded StructTable-InternVL2-1B Model.
- Run the full extraction:
```
python pdf_extract.py --pdf your_file.pdf
```
Output format selection::
- Modify the configuration file configs/model_configs.yamlSettings table_format because of latex,html maybe markdownThe
Results View: The contents of the form will be saved to the output directory in the specified format.

OCR Processing

Processing Scanned PDFs::
- For graphical PDFs, make sure OCR is enabled:
```
python pdf_extract.py --pdf scan_file.pdf --vis
```
- --vis Parameters generate visualization results, annotating the recognized text area.
Check Output: The text content is saved in an editable format, and the results of image-text recognition can be seen at a glance.

Featured Function Operation

Modular Configuration

compiler configs/model_configs.yaml, adjust the parameters:
- img_size: Image resolution.
- conf_thres: Confidence thresholds.
- device: Selection cuda(GPU) or cpuThe

Example:

model_args:
img_size: 1024
conf_thres: 0.5
device: "cuda"

High Performance Optimization

Batch processing can be enabled for high end devices (≥16GB video memory):
```
python pdf_extract.py --pdf your_file.pdf --batch-size 128
```
Increase parsing speed 50% or more, suitable for batch processing.

Multi-language support

set up lang because of autoThe OCR model automatically recognizes the language of the document and selects the appropriate OCR model:
```
ocr_args:
lang: "auto"
```

caveat

hardware requirement: GPUs (e.g. NVIDIA cards) can dramatically increase processing speeds and are recommended to have ≥8GB of video memory.
common problems::
- If you are prompted with a missing cv2Running pip install opencv-pythonThe
- When the model download is incomplete, check the network or change the download method.
Community Support: If you have questions, ask them in GitHub's Discussions or Issues boards.

Through the above steps, users can easily get started with PDF-Extract-Kit and efficiently complete the extraction of complex PDF content.