Visual Language Modeling for Efficient PDF Text Extraction--olmOCR

AI News8mos agorelease AI Sharing Circle

14.9K 00

Language Models (LMs) have become a central driver of innovation in AI technology. From pre-training to real-world applications, language models rely on plain text data to function. Whether it's performing trillions of tokens level training or to support data-intensive AI applications, the quality of text data is crucial. Low-quality text data may not only lead to an unstable training process and degraded model performance, but also produce less-than-optimal outputs when requested by users.

However, not all data required for language modeling exists in an easily parsable format, such as web pages. In fact, in many domains, valuable information is stored in electronic document files, particularly in PDF format.PDF format poses unique challenges in data processing because it was originally designed to present content on a fixed-size page rather than preserve the logical structure of the text. In PDF, for example, the format stores text as a series of character codes and records information about the location and formatting of each character on the page. While this storage is very efficient, it makes it extremely difficult to recover text units such as headings, paragraphs, tables and formulas from them and arrange them in the correct reading order.

In order to better handle electronic documents, we are proud to present the olmOCRolmOCR is a high-performance toolkit designed to convert PDFs and document images into clear, structured plain text. olmOCR is unique in the following ways:

superior performance

In order to ensure olmOCR To accurately extract text from a wide range of documents, the development team fine-tuned the model using 250,000 PDF pages from a variety of sources. These PDF documents came from a wide range of sources, including both native digital documents and scanned copies of public domain books. This diverse dataset ensures that olmOCR maintains excellent performance across a wide range of documents.

Extremely cost-effective

The cost of the olmOCR toolkit to process one million pages of PDF documents is about $190, which is about 1/32 of the cost of batch processing the same number of pages using the GPT-4o API. Significantly lowering the economic barrier to document processing.

Markdown format output

olmOCR outputs text in Markdown format, which is easy to parse and process. It can handle formulas, tables, and even handwritten content, and ensures that even with the most complex, multi-column document layouts, the output is in the correct reading order.

Fully functional, right out of the box

olmOCR is a fully optimized pipeline that works with both SGLang and vLLM The inference engine works in tandem. It scales from a single GPU to hundreds of GPUs and has built-in heuristics to handle common parsing failures and metadata errors.

Completely open source

olmOCR is built on Qwen2-VL-7B-Instruct. The development team has open-sourced all components of the toolkit, including model weights, fine-tuned datasets, and training and inference code.

To see how olmOCR stacks up against other leading document extraction tools, and to learn more about the olmOCR build process, follow the links. If you're ready to try olmOCR, visit the GitHub repository and start using olmOCR in your projects!

Interactive Tools Comparison

By comparing sample documents, you can visualize how olmOCR performs compared to other leading document extraction tools. Using the tabs below, you can view the output of the different tools and gain insight into the key differences in processing quality.

The road to building olmOCR

Traditional OCR techniques often face many challenges when dealing with PDF documents with complex layouts. In order to obtain high-quality data to train olmOCR, the development team has innovatively developed a new technology called document anchoring This is a technique that utilizes the text and metadata already present in the PDF file. The method makes full use of the existing text and metadata in the PDF file to significantly improve the quality of text extraction.

Figure 1: shows how the document anchoring technique works on a typical page. Relevant image locations and text blocks are extracted, linked together, and inserted into the model prompt. The anchored text is used in conjunction with the rasterized image of the page when requesting a plain text version of the document from the VLM (Visual Language Model) запросить.

With the help of document anchoring techniques, the development team used GPT-4o to mark up 250,000 pages. The dataset comes from a wide range of sources, including publicly available PDF documents crawled from the web and public domain books scanned from the Internet Archive. The dataset is of various types, including 60% for academic papers, 12% for brochures, 11% for legal documents, 6% for charts and graphs, 5% for slides, and 4% for other document types.

For model training, the olmOCR team fine-tuned the Qwen2-VL-7B-Instruct checkpoint and used SGLang in order to achieve large-scale batch processing and optimize the inference pipeline. To enable large-scale batch processing and optimize the inference pipeline, they used SGLang. olmOCR ultimately converted one million PDF pages for only $190, which is 1/32 of the cost of the GPT-4o API. experimental results show that, compared to other popular OCR tools, olmOCR not only reduces cost significantly, but also demonstrates better performance in manual evaluation. Experimental results show that olmOCR not only significantly reduces costs compared to other popular OCR tools, but also demonstrates superior performance in manual evaluation.

Figure 2: Boxplot of olmOCR's ELO ranking against other popular tools.

To fully evaluate the performance of olmOCR, the team compared its output with other popular PDF extraction tools, including Marker, MinerU, and GOT-OCR 2.0. 11 researchers were invited to make pairwise judgments. In 2017 PDF documents, 452 sets of meaningful comparisons were collected and performance was quantified by calculating ELO scores. The results show that olmOCR has an ELO score of over 1800, significantly outperforming all competitors. In a direct comparison with other tools, olmOCR scores 61.3% vs. Marker was preferred in the comparison of 58.6% to GOT-OCR, and in the comparison of the MinerU This ratio is even higher in the comparison of 71.4%, which fully demonstrates the excellent ability of olmOCR in generating clear and well-structured text.

You can see more detailed information and other evaluation results in the Technical Report.

How to use olmOCR

The first version of olmOCR includes a demo, model weights, fine-tuned datasets, a brief technical report, and, most importantly, an efficient inference pipeline.

Visit the GitHub repository to install olmOCR and review the documentation. After that, on a machine with a GPU, simply run the following command:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf

The development team hopes to release more quantitative benchmarks in the near future to help develop better PDF extraction models and evaluate their performance more effectively.

AI News

Article copyright AI Sharing Circle All, please do not reproduce without permission.

Disrupting Traditional Healthcare? Google's AI System AMIE Enables Full Disease Management

AI News

7mos ago

017.7K

Free image magnification software - Upscayl latest v2.10.0 Chinese version of the recommended

AI News

1yrs ago

017.3K

Google Gemini launches personalized search feature with deep search history integration

AI News

7mos ago

018K

Cloudflare 推出 AutoRAG：托管式 RAG 服务简化 AI 应用集成

Cloudflare Launches AutoRAG: Managed RAG Service Simplifies AI Application Integration

AI News

6mos ago

019.9K

No comments

You must be logged in to leave a comment!

No comments...

Visual Language Modeling for Efficient PDF Text Extraction--olmOCR

superior performance

Extremely cost-effective

Markdown format output

Fully functional, right out of the box

Completely open source

Interactive Tools Comparison

The road to building olmOCR

How to use olmOCR

Say goodbye to mechanical sounds! All-around AI voice tools explained: text-to-speech, voice cloning, sound effects library in one stop!

Project-level code generation results are in! o3/Claude 3.7 leads the way, R1 is in the top tier!

Related posts

Disrupting Traditional Healthcare? Google's AI System AMIE Enables Full Disease Management

Free image magnification software - Upscayl latest v2.10.0 Chinese version of the recommended

Google Gemini launches personalized search feature with deep search history integration

Cloudflare Launches AutoRAG: Managed RAG Service Simplifies AI Application Integration

No comments

Latest Collections

Latest Articles

Visual Language Modeling for Efficient PDF Text Extraction--olmOCR

superior performance

Extremely cost-effective

Markdown format output

Fully functional, right out of the box

Completely open source

Interactive Tools Comparison

The road to building olmOCR

How to use olmOCR

Say goodbye to mechanical sounds! All-around AI voice tools explained: text-to-speech, voice cloning, sound effects library in one stop!

Project-level code generation results are in! o3/Claude 3.7 leads the way, R1 is in the top tier!

Related posts

Disrupting Traditional Healthcare? Google's AI System AMIE Enables Full Disease Management

Free image magnification software - Upscayl latest v2.10.0 Chinese version of the recommended

Google Gemini launches personalized search feature with deep search history integration

Cloudflare Launches AutoRAG: Managed RAG Service Simplifies AI Application Integration

No comments

Selected AI Tools

Latest Collections

Latest Articles