AI Personal Learning
and practical guidance
讯飞绘镜

PDF-Extract-Kit: extract the complex structure of PDF content of open source tools

General Introduction

PDF-Extract-Kit is an open source project developed by the OpenDataLab team , focusing on the efficient extraction of high-quality content from complex and diverse PDF documents . It integrates advanced document parsing technology , support for layout detection , formula recognition , table extraction and OCR and other functions , applicable to academic papers , research reports , financial documents and other scenarios. The tool adopts a modular design, users can be flexibly configured according to the needs of the user to easily build customized document processing applications. PDF-Extract-Kit provides comprehensive evaluation benchmarks to help users choose the most suitable model, while constantly updating and optimizing, such as the recent addition of a faster DocLayout-YOLO and support for multiple formats of the output of StructTable-InternVL2-1, which can be used in a variety of applications. -InternVL2-1B. Both developers and researchers can realize efficient document content extraction through it.

PDF-Extract-Kit:提取复杂结构PDF内容的开源工具-1


 

Function List

  • Layout Detection: Recognize page layouts in PDF, including areas such as headings, paragraphs, images and tables, with support for efficient models such as DocLayout-YOLO.
  • formula recognition: Extracts and parses mathematical formulas from documents and converts them to LaTeX format, relying on advanced technologies such as UniMERNet.
  • Form Extraction: Supports recognition and extraction of complex table content, with output in LaTeX, HTML and Markdown formats.
  • OCR Processing: Convert text in scanned documents or images into editable text through technologies such as PaddleOCR.
  • Modular Configuration: Provides flexible profiles that allow users to combine different models and quickly build applications.
  • Content evaluation: A variety of built-in PDF parsing benchmarks to help users evaluate the effectiveness of different models.
  • Image and Text Extraction: Support for extracting images from PDFs and recognizing their text content.

 

Using Help

Installation process

PDF-Extract-Kit is supported on multiple operating systems (e.g. Ubuntu, Windows or macOS), here are the detailed installation steps (Ubuntu 20.04 for example):

1. Environmental preparation

  • Make sure Python 3.10 is installed on your system:
    sudo apt update
    sudo apt install python3.10 python3.10-dev python3-pip
  • Create and activate a virtual environment:
    conda create -n pdf-extract-kit python=3.10
    conda activate pdf-extract-kit
    

2. Installation of dependencies

  • Clone the code repository:
    git clone https://github.com/opendatalab/PDF-Extract-Kit.git
    cd PDF-Extract-Kit
    
  • Install core dependencies (available if no GPU) requirements-cpu.txt):
    pip install -r requirements.txt
    

    take note of: If you encounter doclayout-yolo Installation failed, you can install it manually:

    pip3 install doclayout-yolo==0.0.2 --extra-index-url=https://pypi.org/simple
    

3. Download model weights

  • Refer to the official tutorial to download the model files (full or partial download is supported):
    • Automated downloads using Python scripts:
      python scripts/download_models_hf.py
      
    • Or download it manually from Hugging Face:
      git lfs install
      git clone https://huggingface.co/opendatalab/PDF-Extract-Kit-1.0
      
  • After the download is complete, place the model files in the specified path in the project directory (refer to the configs/model_configs.yaml).

4. Verification of installation

  • Run the sample script to test that the environment is working:
    python pdf_extract.py --pdf assets/examples/example.pdf
    

    The output will be saved in the outputs folder.

Functional operation flow

Layout Detection

  1. Preparing PDF files: Place the PDF to be processed into the project directory (e.g. assets/examples/).
  2. Running Layout Inspection::
    • modifications configs/layout_detection.yaml The input path in the
      pdf_path: "assets/examples/example.pdf"
      output_dir: "outputs/layout_detection"
      
    • Execute the command:
      python scripts/layout_detection.py --config=configs/layout_detection.yaml
      
  3. View Results: in outputs/layout_detection folder, generating an image and JSON file with the layout area labeled.

formula recognition

  1. Run formula extraction::
    • Use the default configuration:
      python pdf_extract.py --pdf your_file.pdf --render
      
    • --render parameter renders the formula as an image for easy verification.
  2. View Output: Formulas are stored in output JSON in LaTeX format and can be used directly in academic writing or further processing.

Form Extraction

  1. Execution Form Recognition::
    • Make sure it's downloaded StructTable-InternVL2-1B Model.
    • Run the full extraction:
      python pdf_extract.py --pdf your_file.pdf
      
  2. Output format selection::
    • Modify the configuration file configs/model_configs.yamlSettings table_format because of latex,html maybe markdownThe
  3. Results View: The contents of the form will be saved to the output directory in the specified format.

OCR Processing

  1. Processing Scanned PDFs::
    • For graphical PDFs, make sure OCR is enabled:
      python pdf_extract.py --pdf scan_file.pdf --vis
      
    • --vis Parameters generate visualization results, annotating the recognized text area.
  2. Check Output: The text content is saved in an editable format, and the results of image-text recognition can be seen at a glance.

Featured Function Operation

Modular Configuration

  • compiler configs/model_configs.yaml, adjust the parameters:
    • img_size: Image resolution.
    • conf_thres: Confidence thresholds.
    • device: Selection cuda(GPU) or cpuThe
  • Example:
    model_args:
    img_size: 1024
    conf_thres: 0.5
    device: "cuda"
    

High Performance Optimization

  • Batch processing can be enabled for high end devices (≥16GB video memory):
    python pdf_extract.py --pdf your_file.pdf --batch-size 128
    
  • Increase parsing speed 50% or more, suitable for batch processing.

Multi-language support

  • set up lang because of autoThe OCR model automatically recognizes the language of the document and selects the appropriate OCR model:
    ocr_args:
    lang: "auto"
    

caveat

  • hardware requirement: GPUs (e.g. NVIDIA cards) can dramatically increase processing speeds and are recommended to have ≥8GB of video memory.
  • common problems::
    • If you are prompted with a missing cv2Running pip install opencv-pythonThe
    • When the model download is incomplete, check the network or change the download method.
  • Community Support: If you have questions, ask them in GitHub's Discussions or Issues boards.

Through the above steps, users can easily get started with PDF-Extract-Kit and efficiently complete the extraction of complex PDF content.

May not be reproduced without permission:Chief AI Sharing Circle " PDF-Extract-Kit: extract the complex structure of PDF content of open source tools
en_USEnglish