Marker: quickly convert PDF to Markdown open source tools

Latest AI Resources5mos agoupdate AI Sharing Circle

2.6K 00

General Introduction

Marker is a deep learning-based document processing tool designed to quickly and accurately convert PDF files to Markdown format. It supports a wide range of document types and is especially optimized for converting books and scientific papers.Marker is able to remove redundant content such as headers and footers, format tables and code blocks, and extract and save images. It also converts most formulas to LaTeX format and supports running on GPU, CPU or MPS.

Function List

Convert PDF files to Markdown format
Support for multiple document types, including books and scientific papers
Remove excess content such as headers and footers
Formatting tables and code blocks
Extract and save images
Convert most equations to LaTeX format
Supports GPU, CPU and MPS operations

Using Help

Installation process

Installation of dependencies: Ensure that Python 3.6 and above is installed, and that the following dependencies are installed:
```
pip install marker-pdf
```

running example::

marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10

Guidelines for use

Converting individual files

marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10

--batch_multiplier is a multiple of the default batch size if you have extra VRAM. Higher numbers will use more VRAM, but are faster to process. The default setting is 2. The default batch size requires approximately 3GB of VRAM.
--max_pages is the maximum number of pages to be processed. Omitting this item will convert the entire document.
--langs is an optional comma-separated list of document languages to use for OCR. is optional by default and needs to be supplied if tesseract is used.
--ocr_all_pages is an optional parameter to force OCR of all pages of the PDF, if this parameter or the environment variable `OCR_ALL_PAGES` is true, OCR will be forced.

A list of supported Surya OCR languages can be found in [here are] found. If you need more languages, you can use any of the supported languages, just set the OCR_ENGINE set to ocrmypdfIf OCR is not required, markers can support any language. If OCR is not required, marker can support any language.

Convert multiple files

marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --min_length 10000

--workers is the number of PDFs converted simultaneously. The default setting is 1, but you can increase this value to increase throughput at the cost of increased CPU/GPU utilization. Each worker process will use 5GB of VRAM at peak and 3.5GB on average.
--max is the maximum number of PDFs to convert. Omitting this item will convert all PDFs in the folder.
--min_length is the minimum value for the number of characters to be extracted in a PDF, only PDFs above this value will be considered for processing. If you are processing a lot of PDFs, it is recommended to set this value to avoid OCR of PDFs that are mainly images (which slows down the processing).
--metadata_file is an optional JSON file path containing metadata about the PDF. If provided, this file will be used to set the language for each PDF. Setting the language is optional for Surya (default), but required for Tesseract. The format is as follows:

{
"pdf1.pdf": {"languages": ["English"]},
"pdf2.pdf": {"languages": ["Spanish", "Russian"]},
...
}

You can use language names or codes. The exact code depends on the OCR engine. For a complete list of Surya codes, see [here are], for Tesseract see [here are]

Configuring Marker Environment Variables in FastGPT

To enable the custom resolution service, you need to configure the following environment variables in FastGPT:

CUSTOM_READ_FILE_URL=http://xxxx.com/v1/parse/file
CUSTOM_READ_FILE_EXTENSION=pdf

CUSTOM_READ_FILE_URL - customize the access address of the resolving service, you need to change the host to the address of the resolving service you deployed, and the path remains unchanged
CUSTOM_READ_FILE_EXTENSION - Specifies the file type suffixes that are supported for parsing, multiple file types are separated by commas

Verify the parsing effect

After completing the configuration, you can verify the parsing effect by following the steps below:

Upload a PDF file in the Knowledge Base and confirm the upload
View the system log (you need to set LOG_LEVEL to info or debug level).
You will find that the PDF file parsed by Marker contains full image links, which indicates that the parsing was successful!