AI Personal Learning
and practical guidance
Ali-painted frog

Marker: quickly convert PDF to Markdown open source tools

General Introduction

Marker is a deep learning-based document processing tool designed to quickly and accurately convert PDF files to Markdown format. It supports a wide range of document types and is especially optimized for converting books and scientific papers.Marker is able to remove redundant content such as headers and footers, format tables and code blocks, and extract and save images. It also converts most formulas to LaTeX format and supports running on GPU, CPU or MPS.

 


Marker: quickly convert PDF to Markdown open source tools

 

Function List

  • Convert PDF files to Markdown format
  • Support for multiple document types, including books and scientific papers
  • Remove excess content such as headers and footers
  • Formatting tables and code blocks
  • Extract and save images
  • Convert most equations to LaTeX format
  • Supports GPU, CPU and MPS operations

 

 

Using Help

Installation process

  1. Installation of dependencies: Ensure that Python 3.6 and above is installed, and that the following dependencies are installed:
    pip install marker-pdf
    
  2. running example::
    marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10
    

 

Guidelines for use

 

Converting individual files

marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10
  • ---batch_multiplier is a multiple of the default batch size if you have extra VRAM. Higher numbers will use more VRAM, but are faster to process. The default setting is 2. The default batch size requires approximately 3GB of VRAM.
  • --max_pages is the maximum number of pages to be processed. Omitting this item will convert the entire document.
  • --langs is an optional comma-separated list of document languages to use for OCR. is optional by default and needs to be supplied if tesseract is used.
  • --ocr_all_pages is an optional parameter to force OCR of all pages of the PDF, if this parameter or the environment variable `OCR_ALL_PAGES` is true, OCR will be forced.

A list of supported Surya OCR languages can be found in [here are] found. If you need more languages, you can use any of the supported languages, just set the OCR_ENGINE set to ocrmypdfIf OCR is not required, markers can support any language. If OCR is not required, marker can support any language.

 

Convert multiple files

marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --min_length 10000
  • --workers is the number of PDFs converted simultaneously. The default setting is 1, but you can increase this value to increase throughput at the cost of increased CPU/GPU utilization. Each worker process will use 5GB of VRAM at peak and 3.5GB on average.
  • --max is the maximum number of PDFs to convert. Omitting this item will convert all PDFs in the folder.
  • --min_length is the minimum value for the number of characters to be extracted in a PDF, only PDFs above this value will be considered for processing. If you are processing a lot of PDFs, it is recommended to set this value to avoid OCR of PDFs that are mainly images (which slows down the processing).
  • ---metadata_file is an optional JSON file path containing metadata about the PDF. If provided, this file will be used to set the language for each PDF. Setting the language is optional for Surya (default), but required for Tesseract. The format is as follows:
{
"pdf1.pdf": {"languages": ["English"]}, {
"pdf2.pdf": {"languages": ["Spanish", "Russian"]}, ...
...
}

You can use language names or codes. The exact code depends on the OCR engine. For a complete list of Surya codes, see [here are], for Tesseract see [here are]

 

Configuring Marker Environment Variables in FastGPT

To enable the custom resolution service, you need to configure the following environment variables in FastGPT:

CUSTOM_READ_FILE_URL=http://xxxx.com/v1/parse/file
CUSTOM_READ_FILE_EXTENSION=pdf

  • CUSTOM_READ_FILE_URL - customize the access address of the resolving service, you need to change the host to the address of the resolving service you deployed, and the path remains unchanged
  • CUSTOM_READ_FILE_EXTENSION - Specifies the file type suffixes that are supported for parsing, multiple file types are separated by commas

Verify the parsing effect

After completing the configuration, you can verify the parsing effect by following the steps below:

  1. Upload a PDF file in the Knowledge Base and confirm the upload
  2. View the system log (you need to set LOG_LEVEL to info or debug level).
  3. You will find that the PDF file parsed by Marker contains full image links, which indicates that the parsing was successful!
CDN1
May not be reproduced without permission:Chief AI Sharing Circle " Marker: quickly convert PDF to Markdown open source tools

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish