MinerU2.5 - Shanghai AI Lab and Peking University open source document parsing model

Latest AI Resources4mos agoupdate AI Sharing Circle

40.8K 00

What is MinerU 2.5?

MinerU2.5 is a decoupled visual language model jointly developed by Shanghai Artificial Intelligence Laboratory and Peking University team, focusing on efficiently processing high-resolution document image parsing. The core innovation lies in the two-phase design of "global layout detection followed by local content recognition": the first phase quickly locates the document structure and reading order through low-resolution thumbnails, and the second phase accurately recognizes the key areas after cropping them in native resolution. The model is only 1.2B parameters but can maintain high accuracy on 8K documents, and the measured processing speed of single card RTX 4090 can reach 2.12 pages/second, which is significantly better than similar programs. Uniqueness is also reflected in the special optimization of tables, formulas and other complex elements, such as the OTSL intermediate language to compress the length of HTML sequences, as well as atomic formula decomposition and reorganization technology to solve the problem of structural illusion of long formulas.

Features of MinerU2.5

Efficient two-stage parsing architectureThe decoupling strategy of "first coarse then fine" is adopted: the first stage is to analyze the global layout of the downsampled image to quickly identify the text blocks, tables, formulas and other structural elements in the document; the second stage is to identify the fine-grained content of the high-resolution region only in the native resolution to effectively balance the computational overhead and detail retention.
Superior accuracy and performance: Although the number of parameters is only 1.2B, its comprehensive parsing accuracy in several authoritative benchmarks such as OmniDocBench, olmOCR-bench, etc. comprehensively exceeds that of the Gemini 2.5 Pro, GPT-4o, Qwen2.5-VL-72B and other top-level general-purpose multimodal macromodels, as well as significantly ahead of dots.ocr, MonkeyOCR and other professional document parsing tools.
Strong ability to adapt to complex scenesThe multi- modal fusion architecture deeply integrates text recognition and visual layout analysis, and can effectively deal with missing table lines, skewed text, complex formulas, and other scenarios in which traditional OCR fails. Its performance is stable under extreme conditions such as multi-column layout, illustration interference, fuzzy distortion and low-resolution scanned documents, and it supports the recognition of 20+ languages such as Chinese, English, Japanese and Korean.
Extremely practical and efficient deploymentThe model is small, easy to integrate, and achieves high-speed parsing of 1.7 to 2 pages per second on consumer graphics cards such as RTX 3090 or 4090, making it ideal for real-world business deployments such as RAG (retrieval-enhanced generation) knowledge base construction and large-scale document extraction.
Comprehensive task support with structured outputsLayout Analysis: Innovatively reconstructs layout analysis into a multitasking problem that simultaneously predicts the position, category, rotation angle, and reading order of document elements in a single inference. Supports outputting parsing results to Markdown, JSON, and other structured formats for subsequent processing and application.

Core Benefits of MinerU 2.5

Advanced two-stage parsing architectureThe decoupling strategy is adopted, where the first stage performs efficient global layout analysis on downsampled images to recognize document structure elements; the second stage performs fine-grained content recognition on high-resolution regions at native resolution, effectively balancing computational overhead and detail retention.
Excellent performanceIn OmniDocBench, olmOCR-bench and other authoritative benchmarks, its comprehensive parsing accuracy comprehensively exceeds that of top general multimodal large models such as Gemini 2.5 Pro, GPT-4o, Qwen2.5-VL-72B, etc., and is also significantly ahead of professional document parsing tools such as dots.ocr, MonkeyOCR, PP- StructureV3 and other professional document parsing tools.
Enhanced multitasking paradigm: Redefining layout analysis as a multitasking problem, it simultaneously predicts the position, category, rotation angle, and reading order of document elements in a single inference, effectively solving complex challenges such as parsing rotated elements.
Extremely practical and efficient: The model is compact, easy to integrate, and can realize high-speed parsing of 1.7 pages per second on consumer graphics cards, which is very suitable for practical application scenarios such as RAG (retrieval-enhanced generation) knowledge base construction and large-scale document extraction.

What is MinerU2.5 official website?

HuggingFace Model Library:: https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B
arXiv Technical Paper:: https://arxiv.org/pdf/2509.22186

People for whom MinerU2.5 is intended

Enterprise Digitization and Knowledge Management TeamIt is suitable for enterprises that need to deal with the task of digitizing a large number of contracts, reports, archives and other paper documents, and it can efficiently complete the parsing and warehousing of unstructured data such as scanned documents and PDFs, and significantly improve the efficiency of constructing a RAG (Retrieval Augmented Generation) knowledge base.
Developers and AI engineering teamsThe model is fully open source and has a small reference size (1.2B), supports deployment on consumer graphics cards such as the RTX 4090, and is ideal for developers and engineering teams looking to integrate high-performance OCR capabilities into their products without relying on a large closed-source API.
Research institutions and academia: Provides a powerful open-source baseline model for academic research in the areas of document understanding, multimodal macromodeling, etc., on which researchers can base further experiments, fine-tuning, or method comparisons.
Financial, legal and governmental institutionsMinerU2.5 meets the demanding needs for high-precision, structured information extraction by excelling in scenarios with complex typesetting and missing form lines, where a large number of complexly structured forms, contracts, and forms need to be handled.