AI Personal Learning
and practical guidance
Resource Recommendation 1

PDF-Extract-Kit: extract the complex structure of PDF content of open source tools

General Introduction

PDF-Extract-Kit is an open source project developed by the OpenDataLab team , focusing on the efficient extraction of high-quality content from complex and diverse PDF documents . It integrates advanced document parsing technology , support for layout detection , formula recognition , table extraction and OCR and other functions , applicable to academic papers , research reports , financial documents and other scenarios. The tool adopts a modular design, users can be flexibly configured according to the needs of the user to easily build customized document processing applications. PDF-Extract-Kit provides comprehensive evaluation benchmarks to help users choose the most suitable model, while constantly updating and optimizing, such as the recent addition of a faster DocLayout-YOLO and support for multiple formats of the output of StructTable-InternVL2-1, which can be used in a variety of applications. -InternVL2-1B. Both developers and researchers can realize efficient document content extraction through it.

PDF-Extract-Kit: Extract the complex structure of the PDF content of the open-source tool-1


 

Function List

  • Layout Detection: Recognize page layouts in PDF, including areas such as headings, paragraphs, images and tables, with support for efficient models such as DocLayout-YOLO.
  • formula recognition: Extracts and parses mathematical formulas from documents and converts them to LaTeX format, relying on advanced technologies such as UniMERNet.
  • Form Extraction: Supports recognition and extraction of complex table content, with output in LaTeX, HTML and Markdown formats.
  • OCR Processing: Convert text in scanned documents or images into editable text through technologies such as PaddleOCR.
  • Modular Configuration: Provides flexible profiles that allow users to combine different models and quickly build applications.
  • Content evaluation: A variety of built-in PDF parsing benchmarks to help users evaluate the effectiveness of different models.
  • Image and Text Extraction: Support for extracting images from PDFs and recognizing their text content.

 

Using Help

Installation process

PDF-Extract-Kit is supported on multiple operating systems (e.g. Ubuntu, Windows or macOS), here are the detailed installation steps (Ubuntu 20.04 for example):

1. Environmental preparation

  • Make sure Python 3.10 is installed on your system:
    sudo apt update
    sudo apt install python3.10 python3.10-dev python3-pip
  • Create and activate a virtual environment:
    conda create -n pdf-extract-kit python=3.10
    conda activate pdf-extract-kit
    

2. Installation of dependencies

  • Clone the code repository:
    git clone https://github.com/opendatalab/PDF-Extract-Kit.git
    cd PDF-Extract-Kit
    
  • Install core dependencies (available if no GPU) requirements-cpu.txt):
    pip install -r requirements.txt
    

    take note of: If you encounter doclayout-yolo Installation failed, you can install it manually:

    pip3 install doclayout-yolo==0.0.2 --extra-index-url=https://pypi.org/simple
    

3. Download model weights

  • Refer to the official tutorial to download the model files (full or partial download is supported):
    • Automated downloads using Python scripts:
      python scripts/download_models_hf.py
      
    • Or download it manually from Hugging Face:
      git lfs install
      git clone https://huggingface.co/opendatalab/PDF-Extract-Kit-1.0
      
  • After the download is complete, place the model files in the specified path in the project directory (refer to the configs/model_configs.yaml).

4. Verification of installation

  • Run the sample script to test that the environment is working:
    python pdf_extract.py --pdf assets/examples/example.pdf
    

    The output will be saved in the outputs folder.

Functional operation flow

Layout Detection

  1. Preparing PDF files: Place the PDF to be processed into the project directory (e.g. assets/examples/).
  2. Running Layout Inspection::
    • modifications configs/layout_detection.yaml The input path in the
      pdf_path: "assets/examples/example.pdf"
      output_dir: "outputs/layout_detection"
      
    • Execute the command:
      python scripts/layout_detection.py --config=configs/layout_detection.yaml
      
  3. View Results: in outputs/layout_detection folder, generating an image and JSON file with the layout area labeled.

formula recognition

  1. Run formula extraction::
    • Use the default configuration:
      python pdf_extract.py --pdf your_file.pdf --render
      
    • --render parameter renders the formula as an image for easy verification.
  2. View Output: Formulas are stored in output JSON in LaTeX format and can be used directly in academic writing or further processing.

Form Extraction

  1. Execution Form Recognition::
    • Make sure it's downloaded StructTable-InternVL2-1B Model.
    • Run the full extraction:
      python pdf_extract.py --pdf your_file.pdf
      
  2. Output format selection::
    • Modify the configuration file configs/model_configs.yamlSettings table_format because of latex,html maybe markdownThe
  3. Results View: The contents of the form will be saved to the output directory in the specified format.

OCR Processing

  1. Processing Scanned PDFs::
    • For graphical PDFs, make sure OCR is enabled:
      python pdf_extract.py --pdf scan_file.pdf --vis
      
    • --vis Parameters generate visualization results, annotating the recognized text area.
  2. Check Output: The text content is saved in an editable format, and the results of image-text recognition can be seen at a glance.

Featured Function Operation

Modular Configuration

  • compiler configs/model_configs.yaml, adjust the parameters:
    • img_size: Image resolution.
    • conf_thres: Confidence thresholds.
    • device: Selection cuda(GPU) or cpu (computer)The
  • Example:
    model_args.
    img_size: 1024
    conf_thres: 0.5
    device: "cuda"
    

High Performance Optimization

  • Batch processing can be enabled for high end devices (≥16GB video memory):
    python pdf_extract.py --pdf your_file.pdf --batch-size 128
    
  • Increase parsing speed 50% or more, suitable for batch processing.

Multi-language support

  • set up lang because of autoThe OCR model automatically recognizes the language of the document and selects the appropriate OCR model:
    ocr_args.
    lang: "auto"
    

caveat

  • hardware requirement: GPUs (e.g. NVIDIA cards) can dramatically increase processing speeds and are recommended to have ≥8GB of video memory.
  • common problems::
    • If you are prompted with a missing cv2Running pip install opencv-pythonThe
    • When the model download is incomplete, check the network or change the download method.
  • Community Support: If you have questions, ask them in GitHub's Discussions or Issues boards.

Through the above steps, users can easily get started with PDF-Extract-Kit and efficiently complete the extraction of complex PDF content.

Contents3
May not be reproduced without permission:Chief AI Sharing Circle " PDF-Extract-Kit: extract the complex structure of PDF content of open source tools

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish