AI Personal Learning
and practical guidance

Marker: quickly convert PDF to Markdown open source tools

General Introduction

Marker is a deep learning-based document processing tool designed to quickly and accurately convert PDF files to Markdown format. It supports a wide range of document types and is especially optimized for converting books and scientific papers.Marker is able to remove redundant content such as headers and footers, format tables and code blocks, and extract and save images. It also converts most formulas to LaTeX format and supports running on GPU, CPU or MPS.

 


Marker: quickly convert PDF to Markdown open source tools

 

Function List

  • Convert PDF files to Markdown format
  • Support for multiple document types, including books and scientific papers
  • Remove excess content such as headers and footers
  • Formatting tables and code blocks
  • Extract and save images
  • Convert most equations to LaTeX format
  • Supports GPU, CPU and MPS operations

 

 

Using Help

Installation process

  1. Installation of dependencies: Ensure that Python 3.6 and above is installed, and that the following dependencies are installed:
    pip install marker-pdf
    
  2. running example::
    marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10
    

 

Guidelines for use

 

Converting individual files

marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10
  • ---batch_multiplier is a multiple of the default batch size if you have extra VRAM. Higher numbers will use more VRAM, but are faster to process. The default setting is 2. The default batch size requires approximately 3GB of VRAM.
  • --max_pages is the maximum number of pages to be processed. Omitting this item will convert the entire document.
  • --langs is an optional comma-separated list of document languages to use for OCR. is optional by default and needs to be supplied if tesseract is used.
  • --ocr_all_pages is an optional parameter to force OCR of all pages of the PDF, if this parameter or the environment variable `OCR_ALL_PAGES` is true, OCR will be forced.

A list of supported Surya OCR languages can be found in [here are] found. If you need more languages, you can use any of the supported languages, just set the OCR_ENGINE set to ocrmypdfIf OCR is not required, markers can support any language. If OCR is not required, marker can support any language.

 

Convert multiple files

marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --min_length 10000
  • --workers is the number of PDFs converted simultaneously. The default setting is 1, but you can increase this value to increase throughput at the cost of increased CPU/GPU utilization. Each worker process will use 5GB of VRAM at peak and 3.5GB on average.
  • --max is the maximum number of PDFs to convert. Omitting this item will convert all PDFs in the folder.
  • --min_length is the minimum value for the number of characters to be extracted in a PDF, only PDFs above this value will be considered for processing. If you are processing a lot of PDFs, it is recommended to set this value to avoid OCR of PDFs that are mainly images (which slows down the processing).
  • ---metadata_file is an optional JSON file path containing metadata about the PDF. If provided, this file will be used to set the language for each PDF. Setting the language is optional for Surya (default), but required for Tesseract. The format is as follows:
{
"pdf1.pdf": {"languages": ["English"]}, {
"pdf2.pdf": {"languages": ["Spanish", "Russian"]}, ...
...
}

You can use language names or codes. The exact code depends on the OCR engine. For a complete list of Surya codes, see [here are], for Tesseract see [here are]

AI Easy Learning

The layman's guide to getting started with AI

Help you learn how to utilize AI tools at a low cost and from a zero base.AI, like office software, is an essential skill for everyone. Mastering AI will give you an edge in your job search and half the effort in your future work and studies.

View Details>
May not be reproduced without permission:Chief AI Sharing Circle " Marker: quickly convert PDF to Markdown open source tools

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish