Language Models (LMs) have become a central driver of innovation in AI technology. From pre-training to real-world applications, language models rely on plain text data to function. Whether it's performing trillions of tokens level training or to support data-intensive AI applications, the quality of text data is crucial. Low-quality text data may not only lead to an unstable training process and degraded model performance, but also produce less-than-optimal outputs when requested by users.
However, not all data required for language modeling exists in an easily parsable format, such as web pages. In fact, in many domains, valuable information is stored in electronic document files, particularly in PDF format.PDF format poses unique challenges in data processing because it was originally designed to present content on a fixed-size page rather than preserve the logical structure of the text. In PDF, for example, the format stores text as a series of character codes and records information about the location and formatting of each character on the page. While this storage is very efficient, it makes it extremely difficult to recover text units such as headings, paragraphs, tables and formulas from them and arrange them in the correct reading order.
In order to better handle electronic documents, we are proud to present the olmOCRolmOCR is a high-performance toolkit designed to convert PDFs and document images into clear, structured plain text. olmOCR is unique in the following ways:
superior performance
In order to ensure olmOCR To accurately extract text from a wide range of documents, the development team fine-tuned the model using 250,000 PDF pages from a variety of sources. These PDF documents came from a wide range of sources, including both native digital documents and scanned copies of public domain books. This diverse dataset ensures that olmOCR maintains excellent performance across a wide range of documents.
Extremely cost-effective
The cost of the olmOCR toolkit to process one million pages of PDF documents is about $190, which is about 1/32 of the cost of batch processing the same number of pages using the GPT-4o API. Significantly lowering the economic barrier to document processing.
Markdown format output
olmOCR outputs text in Markdown format, which is easy to parse and process. It can handle formulas, tables, and even handwritten content, and ensures that even with the most complex, multi-column document layouts, the output is in the correct reading order.
Fully functional, right out of the box
olmOCR is a fully optimized pipeline that works with both SGLang and vLLM The inference engine works in tandem. It scales from a single GPU to hundreds of GPUs and has built-in heuristics to handle common parsing failures and metadata errors.
Completely open source
olmOCR is built on Qwen2-VL-7B-Instruct. The development team has open-sourced all components of the toolkit, including model weights, fine-tuned datasets, and training and inference code.
To see how olmOCR stacks up against other leading document extraction tools, and to learn more about the olmOCR build process, follow the links. If you're ready to try olmOCR, visit the GitHub repository and start using olmOCR in your projects!
Interactive Tools Comparison
By comparing sample documents, you can visualize how olmOCR performs compared to other leading document extraction tools. Using the tabs below, you can view the output of the different tools and gain insight into the key differences in processing quality.
The road to building olmOCR
Traditional OCR techniques often face many challenges when dealing with PDF documents with complex layouts. In order to obtain high-quality data to train olmOCR, the development team has innovatively developed a new technology called document anchoring This is a technique that utilizes the text and metadata already present in the PDF file. The method makes full use of the existing text and metadata in the PDF file to significantly improve the quality of text extraction.
Figure 1: shows how the document anchoring technique works on a typical page. Relevant image locations and text blocks are extracted, linked together, and inserted into the model prompt. The anchored text is used in conjunction with the rasterized image of the page when requesting a plain text version of the document from the VLM (Visual Language Model) запросить.
With the help of document anchoring techniques, the development team used GPT-4o to mark up 250,000 pages. The dataset comes from a wide range of sources, including publicly available PDF documents crawled from the web and public domain books scanned from the Internet Archive. The dataset is of various types, including 60% for academic papers, 12% for brochures, 11% for legal documents, 6% for charts and graphs, 5% for slides, and 4% for other document types.
For model training, the olmOCR team fine-tuned the Qwen2-VL-7B-Instruct checkpoint and used SGLang in order to achieve large-scale batch processing and optimize the inference pipeline. To enable large-scale batch processing and optimize the inference pipeline, they used SGLang. olmOCR ultimately converted one million PDF pages for only $190, which is 1/32 of the cost of the GPT-4o API. experimental results show that, compared to other popular OCR tools, olmOCR not only reduces cost significantly, but also demonstrates better performance in manual evaluation. Experimental results show that olmOCR not only significantly reduces costs compared to other popular OCR tools, but also demonstrates superior performance in manual evaluation.
Figure 2: Boxplot of olmOCR's ELO ranking against other popular tools.
To fully evaluate the performance of olmOCR, the team compared its output with other popular PDF extraction tools, including Marker, MinerU, and GOT-OCR 2.0. 11 researchers were invited to make pairwise judgments. In 2017 PDF documents, 452 sets of meaningful comparisons were collected and performance was quantified by calculating ELO scores. The results show that olmOCR has an ELO score of over 1800, significantly outperforming all competitors. In a direct comparison with other tools, olmOCR scores 61.3% vs. Marker was preferred in the comparison of 58.6% to GOT-OCR, and in the comparison of the MinerU This ratio is even higher in the comparison of 71.4%, which fully demonstrates the excellent ability of olmOCR in generating clear and well-structured text.
You can see more detailed information and other evaluation results in the Technical Report.
How to use olmOCR
The first version of olmOCR includes a demo, model weights, fine-tuned datasets, a brief technical report, and, most importantly, an efficient inference pipeline.
Visit the GitHub repository to install olmOCR and review the documentation. After that, on a machine with a GPU, simply run the following command:
python -m olmocr.pipeline . /localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
The development team hopes to release more quantitative benchmarks in the near future to help develop better PDF extraction models and evaluate their performance more effectively.