DeepOCR - Open source replica project based on the DeepSeek-OCR model

Latest AI Resources4mos agorelease AI Sharing Circle

23.7K 00

What is DeepOCR

DeepOCR is an open source replication project that implements the DeepSeek-OCR The core architecture of the program efficiently processes textual information by means of optical compression techniques. The core is DeepEncoder, which consists of a SAM-base (to process high-resolution images), a 16× convolutional compressor (to reduce the number of token DeepOCR uses a two-stage training process: the first stage uses the LLaVA-CC3M dataset for visual-linguistic alignment. This design significantly reduces the activation memory and token count while maintaining high resolution processing power.DeepOCR employs a two-stage training process: the first stage uses the LLaVA-CC3M dataset for vision-lingual alignment training; the second stage uses the olmOCR OCR-specific pre-training is performed on the dataset. With this training approach, DeepOCR performs well in the OmniDocBench and olmOCR benchmarks, especially in the English text recognition and table parsing tasks, validating the effectiveness of optical compression.

Features of DeepOCR

optical compression: Efficient compression of textual information by rendering text as images and processing it with visual coders such as SAM and CLIP, achieving compression rates of up to 7-20 times.
High Resolution Processing: Supports 1024×1024 and higher resolution image inputs, and efficiently manages activation memory through the window attention mechanism and convolutional compression technology.
multimodal fusion: The local features of SAM and the global semantic features of CLIP are spliced to generate 2048-dimensional fusion features that provide rich information for downstream tasks.
Two-stage training: Visual-linguistic alignment training in the first phase and pre-training for OCR tasks in the second phase ensures that the model performs well in text recognition and document parsing tasks.
low-computing-power friendly: By freezing the DeepEncoder (SAM + CLIP), the graphics memory requirement is dramatically reduced, allowing the model to complete training on limited GPU resources (e.g., 2×H200).
open source implementation: Fully open source based on the VILA framework, providing the research community with an accessible platform for exploring optical context compression mechanisms.
benchmarking: The model's performance is validated in the OmniDocBench and olmOCR benchmarks, and it performs particularly well in the English text recognition and table parsing tasks.

Core Benefits of DeepOCR

Efficient compression::Optical compression techniques, where text is rendered as an image and processed using a visual encoder, significantly reduce the number of text tokens by up to 7-20 times. This makes the model more efficient in processing long text and reduces computational resource requirements.
High-resolution processing capability::It supports high-resolution inputs (e.g. 1024×1024), and efficiently manages activation memory to avoid memory explosion through the Window Attention Mechanism (SAM) and convolutional compression techniques. This enables DeepOCR to handle complex document layouts and high-resolution images.
multimodal fusion::The local features of SAM are fused with the global semantic features of CLIP to generate 2048-dimensional rich features. This multimodal fusion provides more comprehensive information for downstream tasks and improves the performance of the model.
low-computing-power friendly::During the training process, DeepEncoder (SAM + CLIP) is frozen, dramatically reducing the graphics memory requirement. This allows the model to complete training on limited GPU resources (e.g., 2×H200), lowering the hardware threshold and making it suitable for small and medium-sized teams.

What is DeepOCR's official website

Project website:: https://pkulium.github.io/DeepOCR_website/
Github repository:: https://github.com/pkulium/DeepOCR

Who DeepOCR is for

Developers in document processing and OCR::Long text and complex document layouts need to be processed efficiently, and DeepOCR's optical compression and high-resolution processing capabilities can significantly improve document parsing efficiency.
Small and medium-sized teams and independent developers::The low-computer-friendly nature of DeepOCR makes it suitable for running on limited hardware resources, lowering the development threshold.
Open Source Community Contributors::Members of the open source community can participate in code contributions, improvements, and extensions to advance the technology.
Academic researchers interested in innovative technologies::We hope to explore the application of optical compression in different fields, such as image understanding and UI element detection.
Businesses and organizations that need efficient text processing::The efficient compression and processing capabilities of DeepOCR can be leveraged to optimize internal document processing and improve work efficiency.