GOT-OCR2.0: end-to-end multimodal OCR model based on QWen2 0.5B

Latest AI Resources11mos agoupdate AI Sharing Circle

2.1K 00

General Introduction

GOT-OCR2.0 is a StepStar co-presented de Open Source Optical Character Recognition (OCR) model, which aims to drive OCR technology towards OCR-2.0 through a unified end-to-end model. The model supports a wide range of OCR tasks, including plain text recognition, formatted text recognition, fine-grained OCR, multi-crop OCR, and multi-page OCR.GOT-OCR2.0 is designed with the goal of providing a versatile and efficient solution for a wide range of complex OCR application scenarios.

Based on QWen2 0.5 B model. Called OCR 2.0, the end-to-end OCR model with 580M parameters got a BLEU score of 0.972. Online experience at https://huggingface.co/spaces/ucaslcl/GOT_online

Function List

Ordinary Text Recognition: Recognize ordinary text content in images.
Formatted Text Recognition: Recognizes and retains formatting information of text, such as tables, paragraphs, etc.
Fine-grained OCR: Recognize fine text in images and text against complex backgrounds.
Multi-crop OCR: Supports multiple cropping of images and recognizes the text in each cropped area.
Multi-page OCR: Supports OCR of multi-page documents.

Using Help

Installation process

Clone the project code:

git clone https://github.com/Ucas-HaoranWei/GOT-OCR2.0.git
cd GOT-OCR2.0

Create and activate a virtual environment:

conda create -n got python=3.10 -y
conda activate got

Install project dependencies:
```
pip install -e .
```

Install Flash-Attention:

pip install ninja
pip install flash-attn --no-build-isolation

Obtaining GOT model weights

Huggingface
Google Drive
Baidu cloud(Extraction code: OCR2)

Usage Process

Prepare input data: Place the image or document to be OCR'd in the specified input directory.

Run the OCR model:

python3 GOT/demo/run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type ocr

View Output Results: The OCR processed text will be saved in the specified output directory, and users can further process it as needed.

Functional operation details

Plain text recognition: Recognizes and outputs ordinary text content in images as plain text files, suitable for simple text extraction tasks.
Formatted Text Recognition: Preserve formatting information, such as tables, paragraphs, etc., while recognizing text, for scenarios where the original formatting of the document needs to be preserved.
Fine-grained OCR: Recognize fine text in complex backgrounds, suitable for scenes requiring high-precision text extraction.
Multi-crop OCR: Crops the image multiple times and recognizes the text in each cropped region, suitable for scenarios that require multi-region recognition of images.
Multi-page OCR: Supports OCR of multi-page documents, suitable for scenarios where long documents or multi-page PDF files are processed.

With the above steps, users can easily install and use the GOT-OCR2.0 model for various OCR tasks. The model provides a rich set of functional modules that can meet the OCR needs in different scenarios.