AI Personal Learning
and practical guidance

MinerU: PDF document extraction and conversion to multimodal Markdown format, support e-book OCR scanning

General Introduction

MinerU is an open source data extraction tool developed by the OpenDataLab team at the Shanghai Artificial Intelligence Laboratory, focusing on efficiently extracting content from complex PDF documents, web pages, and eBooks. It is able to convert multimodal PDF documents containing images, formulas, tables and other elements into an easy-to-analyze Markdown format, which greatly improves the efficiency of AI corpus preparation.MinerU consists of two main components: Magic-PDF and Magic-Doc, which are used to process PDF documents and web pages and eBooks, respectively. The tool supports cross-platform operation and is compatible with Windows, Linux and macOS systems.

MinerU Online Experience modelscope huggingface


 

MinerU: PDF document extraction and conversion to multimodal Markdown format, support e-book OCR scanning-1

 

Function List

  • Automatically remove headers, footers, footnotes and page numbers from PDFs
  • Preserve the structure and formatting of the original document such as headings, paragraphs, lists, etc.
  • Convert images and tables in documents to Markdown formatting
  • Convert math formulas in PDF to LaTeX format
  • Compatible with Windows, Linux and macOS operating systems
  • Support for extracting content from web pages and eBooks

 

Using Help

Installation process

  1. environmental preparation::
    • Make sure that Python 3.9 or later is installed on your system.
    • A virtual environment (such as venv or conda) is recommended to avoid dependency conflicts.
  2. Installation of dependencies::
    • Create a virtual environment using conda:
      conda create -n MinerU python=3.10
      conda activate MinerU
      
    • Or use venv:
      python -m venv MinerU
      source MinerU/bin/activate  # on Linux or macOS
      MinerU\Scripts\activate  # on Windows
      
  3. Install Magic-PDF::
    • Install the dependencies, especially detectron2, which is a full-featured package that is compiled and installed. Use the following command to install the precompiled detectron2 package (Python 3.10 only):
      pip install detectron2 --extra-index-url https://wheels.myhloli.com
      
    • Install the full-featured package of Magic-PDF:
      pip install magic-pdf[full]==0.6.2b1
      
  4. Download the model weights file::
    • Download the model weights file according to the instructions in the project documentation and move it to a directory with sufficient disk space, preferably an SSD.
  5. Configure Magic-PDF::
    • Copy the magic-pdf.template.json configuration file from the root directory of the repository to your working directory and rename it magic-pdf.json:
      cp magic-pdf.template.json ~/magic-pdf.json
      
    • Configure "models-dir" in the magic-pdf.json file to point to the directory where the model weights are located:
      {
        "models-dir": "/tmp/models"
      }
      
  6. Acceleration configuration (if required)::
    • If you have an available Nvidia GPU or use a Mac with Apple Silicon, you can use CUDA or MPS for acceleration. For CUDA, install the version of PyTorch that corresponds to your version of CUDA:
      pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
      
    • Modify the "device-mode" value in the magic-pdf.json configuration file to enable acceleration.

Using Magic-PDF

Use Magic-PDF via the command line:

magic-pdf pdf-command --pdf "pdf_path" --inside_model true

This will process the specified PDF file and save the resulting Markdown file in the /tmp/magic-pdf directory.

Using Magic-Doc

The installation and configuration process for Magic-Doc is similar to Magic-PDF, but the specific commands and configuration details may differ. Refer to the project's documentation for more information.

May not be reproduced without permission:Chief AI Sharing Circle " MinerU: PDF document extraction and conversion to multimodal Markdown format, support e-book OCR scanning

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish