General Introduction
Marker is a deep learning-based document processing tool designed to quickly and accurately convert PDF files to Markdown format. It supports a wide range of document types and is especially optimized for converting books and scientific papers.Marker is able to remove redundant content such as headers and footers, format tables and code blocks, and extract and save images. It also converts most formulas to LaTeX format and supports running on GPU, CPU or MPS.
Function List
- Convert PDF files to Markdown format
- Support for multiple document types, including books and scientific papers
- Remove excess content such as headers and footers
- Formatting tables and code blocks
- Extract and save images
- Convert most equations to LaTeX format
- Supports GPU, CPU and MPS operations
Using Help
Installation process
- Installation of dependencies: Ensure that Python 3.6 and above is installed, and that the following dependencies are installed:
pip install marker-pdf
- running example::
marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10
Guidelines for use
Converting individual files
marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10
---batch_multiplier
is a multiple of the default batch size if you have extra VRAM. Higher numbers will use more VRAM, but are faster to process. The default setting is 2. The default batch size requires approximately 3GB of VRAM.--max_pages
is the maximum number of pages to be processed. Omitting this item will convert the entire document.--langs
is an optional comma-separated list of document languages to use for OCR. is optional by default and needs to be supplied if tesseract is used.--ocr_all_pages
is an optional parameter to force OCR of all pages of the PDF, if this parameter or the environment variable `OCR_ALL_PAGES` is true, OCR will be forced.
A list of supported Surya OCR languages can be found in [here are] found. If you need more languages, you can use any of the supported languages, just set the OCR_ENGINE
set to ocrmypdf
If OCR is not required, markers can support any language. If OCR is not required, marker can support any language.
Convert multiple files
marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --min_length 10000
--workers
is the number of PDFs converted simultaneously. The default setting is 1, but you can increase this value to increase throughput at the cost of increased CPU/GPU utilization. Each worker process will use 5GB of VRAM at peak and 3.5GB on average.--max
is the maximum number of PDFs to convert. Omitting this item will convert all PDFs in the folder.--min_length
is the minimum value for the number of characters to be extracted in a PDF, only PDFs above this value will be considered for processing. If you are processing a lot of PDFs, it is recommended to set this value to avoid OCR of PDFs that are mainly images (which slows down the processing).---metadata_file
is an optional JSON file path containing metadata about the PDF. If provided, this file will be used to set the language for each PDF. Setting the language is optional for Surya (default), but required for Tesseract. The format is as follows:
{ "pdf1.pdf": {"languages": ["English"]}, { "pdf2.pdf": {"languages": ["Spanish", "Russian"]}, ... ... }
You can use language names or codes. The exact code depends on the OCR engine. For a complete list of Surya codes, see [here are], for Tesseract see [here are]