General Introduction
PDF-Extract-Kit is an open source project developed by the OpenDataLab team , focusing on the efficient extraction of high-quality content from complex and diverse PDF documents . It integrates advanced document parsing technology , support for layout detection , formula recognition , table extraction and OCR and other functions , applicable to academic papers , research reports , financial documents and other scenarios. The tool adopts a modular design, users can be flexibly configured according to the needs of the user to easily build customized document processing applications. PDF-Extract-Kit provides comprehensive evaluation benchmarks to help users choose the most suitable model, while constantly updating and optimizing, such as the recent addition of a faster DocLayout-YOLO and support for multiple formats of the output of StructTable-InternVL2-1, which can be used in a variety of applications. -InternVL2-1B. Both developers and researchers can realize efficient document content extraction through it.
Function List
- Layout Detection: Recognize page layouts in PDF, including areas such as headings, paragraphs, images and tables, with support for efficient models such as DocLayout-YOLO.
- formula recognition: Extracts and parses mathematical formulas from documents and converts them to LaTeX format, relying on advanced technologies such as UniMERNet.
- Form Extraction: Supports recognition and extraction of complex table content, with output in LaTeX, HTML and Markdown formats.
- OCR Processing: Convert text in scanned documents or images into editable text through technologies such as PaddleOCR.
- Modular Configuration: Provides flexible profiles that allow users to combine different models and quickly build applications.
- Content evaluation: A variety of built-in PDF parsing benchmarks to help users evaluate the effectiveness of different models.
- Image and Text Extraction: Support for extracting images from PDFs and recognizing their text content.
Using Help
Installation process
PDF-Extract-Kit is supported on multiple operating systems (e.g. Ubuntu, Windows or macOS), here are the detailed installation steps (Ubuntu 20.04 for example):
1. Environmental preparation
- Make sure Python 3.10 is installed on your system:
sudo apt update sudo apt install python3.10 python3.10-dev python3-pip
- Create and activate a virtual environment:
conda create -n pdf-extract-kit python=3.10 conda activate pdf-extract-kit
2. Installation of dependencies
- Clone the code repository:
git clone https://github.com/opendatalab/PDF-Extract-Kit.git cd PDF-Extract-Kit
- Install core dependencies (available if no GPU)
requirements-cpu.txt
):pip install -r requirements.txt
take note of: If you encounter
doclayout-yolo
Installation failed, you can install it manually:pip3 install doclayout-yolo==0.0.2 --extra-index-url=https://pypi.org/simple
3. Download model weights
- Refer to the official tutorial to download the model files (full or partial download is supported):
- Automated downloads using Python scripts:
python scripts/download_models_hf.py
- Or download it manually from Hugging Face:
git lfs install git clone https://huggingface.co/opendatalab/PDF-Extract-Kit-1.0
- Automated downloads using Python scripts:
- After the download is complete, place the model files in the specified path in the project directory (refer to the
configs/model_configs.yaml
).
4. Verification of installation
- Run the sample script to test that the environment is working:
python pdf_extract.py --pdf assets/examples/example.pdf
The output will be saved in the
outputs
folder.
Functional operation flow
Layout Detection
- Preparing PDF files: Place the PDF to be processed into the project directory (e.g.
assets/examples/
). - Running Layout Inspection::
- modifications
configs/layout_detection.yaml
The input path in thepdf_path: "assets/examples/example.pdf" output_dir: "outputs/layout_detection"
- Execute the command:
python scripts/layout_detection.py --config=configs/layout_detection.yaml
- modifications
- View Results: in
outputs/layout_detection
folder, generating an image and JSON file with the layout area labeled.
formula recognition
- Run formula extraction::
- Use the default configuration:
python pdf_extract.py --pdf your_file.pdf --render
--render
parameter renders the formula as an image for easy verification.
- Use the default configuration:
- View Output: Formulas are stored in output JSON in LaTeX format and can be used directly in academic writing or further processing.
Form Extraction
- Execution Form Recognition::
- Make sure it's downloaded
StructTable-InternVL2-1B
Model. - Run the full extraction:
python pdf_extract.py --pdf your_file.pdf
- Make sure it's downloaded
- Output format selection::
- Modify the configuration file
configs/model_configs.yaml
Settingstable_format
because oflatex
,html
maybemarkdown
The
- Modify the configuration file
- Results View: The contents of the form will be saved to the output directory in the specified format.
OCR Processing
- Processing Scanned PDFs::
- For graphical PDFs, make sure OCR is enabled:
python pdf_extract.py --pdf scan_file.pdf --vis
--vis
Parameters generate visualization results, annotating the recognized text area.
- For graphical PDFs, make sure OCR is enabled:
- Check Output: The text content is saved in an editable format, and the results of image-text recognition can be seen at a glance.
Featured Function Operation
Modular Configuration
- compiler
configs/model_configs.yaml
, adjust the parameters:img_size
: Image resolution.conf_thres
: Confidence thresholds.device
: Selectioncuda
(GPU) orcpu (computer)
The
- Example:
model_args. img_size: 1024 conf_thres: 0.5 device: "cuda"
High Performance Optimization
- Batch processing can be enabled for high end devices (≥16GB video memory):
python pdf_extract.py --pdf your_file.pdf --batch-size 128
- Increase parsing speed 50% or more, suitable for batch processing.
Multi-language support
- set up
lang
because ofauto
The OCR model automatically recognizes the language of the document and selects the appropriate OCR model:ocr_args. lang: "auto"
caveat
- hardware requirement: GPUs (e.g. NVIDIA cards) can dramatically increase processing speeds and are recommended to have ≥8GB of video memory.
- common problems::
- If you are prompted with a missing
cv2
Runningpip install opencv-python
The - When the model download is incomplete, check the network or change the download method.
- If you are prompted with a missing
- Community Support: If you have questions, ask them in GitHub's Discussions or Issues boards.
Through the above steps, users can easily get started with PDF-Extract-Kit and efficiently complete the extraction of complex PDF content.