General Introduction
MinerU is an open source data extraction tool developed by the OpenDataLab team at the Shanghai Artificial Intelligence Laboratory, focusing on efficiently extracting content from complex PDF documents, web pages, and eBooks. It is able to convert multimodal PDF documents containing images, formulas, tables and other elements into an easy-to-analyze Markdown format, which greatly improves the efficiency of AI corpus preparation.MinerU consists of two main components: Magic-PDF and Magic-Doc, which are used to process PDF documents and web pages and eBooks, respectively. The tool supports cross-platform operation and is compatible with Windows, Linux and macOS systems.
MinerU Online Experience modelscope huggingface
Function List
- Automatically remove headers, footers, footnotes and page numbers from PDFs
- Preserve the structure and formatting of the original document such as headings, paragraphs, lists, etc.
- Convert images and tables in documents to Markdown formatting
- Convert math formulas in PDF to LaTeX format
- Compatible with Windows, Linux and macOS operating systems
- Support for extracting content from web pages and eBooks
Using Help
Installation process
- environmental preparation::
- Make sure that Python 3.9 or later is installed on your system.
- A virtual environment (such as venv or conda) is recommended to avoid dependency conflicts.
- Installation of dependencies::
- Create a virtual environment using conda:
conda create -n MinerU python=3.10 conda activate MinerU
- Or use venv:
python -m venv MinerU source MinerU/bin/activate # on Linux or macOS MinerU\Scripts\activate # on Windows
- Create a virtual environment using conda:
- Install Magic-PDF::
- Install the dependencies, especially detectron2, which is a full-featured package that is compiled and installed. Use the following command to install the precompiled detectron2 package (Python 3.10 only):
pip install detectron2 --extra-index-url https://wheels.myhloli.com
- Install the full-featured package of Magic-PDF:
pip install magic-pdf[full]==0.6.2b1
- Install the dependencies, especially detectron2, which is a full-featured package that is compiled and installed. Use the following command to install the precompiled detectron2 package (Python 3.10 only):
- Download the model weights file::
- Download the model weights file according to the instructions in the project documentation and move it to a directory with sufficient disk space, preferably an SSD.
- Configure Magic-PDF::
- Copy the magic-pdf.template.json configuration file from the root directory of the repository to your working directory and rename it magic-pdf.json:
cp magic-pdf.template.json ~/magic-pdf.json
- Configure "models-dir" in the magic-pdf.json file to point to the directory where the model weights are located:
{ "models-dir": "/tmp/models" }
- Copy the magic-pdf.template.json configuration file from the root directory of the repository to your working directory and rename it magic-pdf.json:
- Acceleration configuration (if required)::
- If you have an available Nvidia GPU or use a Mac with Apple Silicon, you can use CUDA or MPS for acceleration. For CUDA, install the version of PyTorch that corresponds to your version of CUDA:
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
- Modify the "device-mode" value in the magic-pdf.json configuration file to enable acceleration.
- If you have an available Nvidia GPU or use a Mac with Apple Silicon, you can use CUDA or MPS for acceleration. For CUDA, install the version of PyTorch that corresponds to your version of CUDA:
Using Magic-PDF
Use Magic-PDF via the command line:
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
This will process the specified PDF file and save the resulting Markdown file in the /tmp/magic-pdf directory.
Using Magic-Doc
The installation and configuration process for Magic-Doc is similar to Magic-PDF, but the specific commands and configuration details may differ. Refer to the project's documentation for more information.