Ollama OCR: Extracting Text from Images Using Visual Models in Ollama

Latest AI Resources7mos agorelease AI Sharing Circle

3.1K 00

General Introduction

Ollama OCR is a powerful Optical Character Recognition (OCR) toolkit that utilizes state-of-the-art visual language models provided by the Ollama platform to extract text from images. The project is available both as a Python package and as a user-friendly Streamlit web application interface. It supports a wide range of vision models, including LLaVA 7B for real-time processing and the high-precision Llama 3.2 Vision model for complex documents.Ollama OCR is distinguished by its support for a wide range of output formats, including Markdown, plain text, JSON, etc., and by its batch processing capabilities. The tool is particularly suitable for developers and researchers who need to extract and structure text data from images.

Function List

Support for multiple advanced visual language models (LLaVA 7B and Llama 3.2 Vision)
Provide diverse output formats (Markdown, plain text, JSON, structured data, key-value pairs)
Support batch image processing function, can process multiple images in parallel
Built-in image pre-processing (resizing, normalization, etc.)
Provide progress tracking and processing statistics
Supports the user-friendly Streamlit web interface
Supports drag-and-drop image uploading and real-time processing
Provide download function for extracted text
Integrated image preview and detailed information display

Using Help

1. Installation steps

The Ollama platform needs to be installed first:
- Visit the official Ollama website to download the installation package for your system.
- Complete the basic installation of Ollama
Install the required visual model:

ollama pull llama3.2-vision:11b

Install the Ollama OCR package:

pip install ollama-ocr

2. Python package usage

2.1 Single Image Processing

from ollama_ocr import OCRProcessor
# 初始化OCR处理器
ocr = OCRProcessor(model_name='llama3.2-vision:11b')
# 处理单张图像
result = ocr.process_image(
image_path="图片路径.png",
format_type="markdown"  # 可选格式：markdown, text, json, structured, key_value
)
print(result)

2.2 Batch Processing Images

# 初始化OCR处理器，设置并行处理数
ocr = OCRProcessor(model_name='llama3.2-vision:11b', max_workers=4)
# 批量处理图像
batch_results = ocr.process_batch(
input_path="图片文件夹路径",
format_type="markdown",
recursive=True,  # 搜索子目录
preprocess=True  # 启用图像预处理
)
# 查看处理结果
for file_path, text in batch_results['results'].items():
print(f"\n文件: {file_path}")
print(f"提取的文本: {text}")
# 查看处理统计
print(f"总图像数: {batch_results['statistics']['total']}")
print(f"成功处理: {batch_results['statistics']['successful']}")
print(f"处理失败: {batch_results['statistics']['failed']}")

3. How to use the Streamlit web application

Clone the code repository:

git clone https://github.com/imanoop7/Ollama-OCR.git
cd Ollama-OCR

Install the dependencies:

pip install -r requirements.txt

Launch the web application:

cd src/ollama_ocr
streamlit run app.py

4. Description of output formats

Markdown formatting: retains text formatting, including headings and lists
Plain text formatting: provides clean and concise text extraction
JSON format: structured data format output
Structured formats: tables and organized data
Key-value pair format: extracting labeled information

5. Cautions

The LLaVA model may occasionally produce incorrect output, and it is recommended that the Llama 3.2 Vision model be used for important scenarios
Image preprocessing can improve recognition accuracy
When batch processing, pay attention to the reasonable setting of the number of parallelism, to avoid excessive memory consumption
It is recommended to enable progress tracking when processing a large number of images