SmolDocling: a visual language model for efficient document processing in a small volume

🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

General Introduction

SmolDocling is a Visual Language Model (VLM) developed by the ds4sd team in collaboration with IBM, based on SmolVLM-256M and hosted on the Hugging Face platform. It is the world's smallest VLM with only 256M parameters, and its core function is to extract text from images, recognize layouts, codes, formulas, and charts, and generate structured documents in DocTags format. smolDocling can run on ordinary devices with high efficiency and low resource consumption. The development team is sharing this model through open source in the hope of helping more people deal with document tasks. It is part of the SmolVLM family, which specializes in document conversion and is suitable for users who need to process complex documents quickly.

SmolDocling：小体积高效处理文档的视觉语言模型-1

SmolDocling: a visual language model for efficient processing of documents in small size-1

Function List

Text Extraction (OCR): Recognize and extract text from images, support multi-language.
Layout Recognition: Analyze the structure of a document in a picture, such as the position of headings, paragraphs, and tables.
code recognition: Extracts code blocks and preserves indentation and formatting.
formula recognition: Detect math formulas and convert to editable text.
chart recognition: Parses the content of the chart in the image and extracts the data.
Forms processing: Recognizes the structure of a table and retains row and column information.
DocTags Output: Convert processing results into a uniform markup format for easy subsequent use.
High Resolution Image Processing: Supports larger resolution image input to improve recognition accuracy.

Using Help

The use of SmolDocling is divided into two parts: installation and operation. Below are detailed steps to help users get started quickly.

Installation process

Preparing the environment
- Make sure your computer has Python 3.8 or later installed.
- Install the dependency libraries by entering the following command in the terminal:
```
pip install torch transformers docling_core
```
- If you have a GPU, it is recommended to install PyTorch with CUDA support to run faster. Check the methodology:
```
import torch
print("GPU可用：" if torch.cuda.is_available() else "使用CPU")
```
Loading Models
- SmolDocling doesn't need to be downloaded manually, it is available directly from Hugging Face via code.
- Ensure that the network is open and that the first run will automatically download the model file.

Procedure for use

Prepare the picture
- Find an image that contains text, such as a scanned document or a screenshot.
- Load the image with code:
```
from transformers.image_utils import load_image
image = load_image("你的图片路径.jpg")
```

Initialize the model and processor

Load SmolDocling's processor and model:

from transformers import AutoProcessor, AutoModelForVision2Seq
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16
).to(DEVICE)

Generate DocTags

Set up the inputs and run the model:

messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Convert this page to docling."}]}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(DEVICE)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
doctags = processor.batch_decode(generated_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=False)[0].lstrip()
print(doctags)

Conversion to common formats

Convert DocTags to Markdown or other formats:

from docling_core.types.doc import DoclingDocument
doc = DoclingDocument(name="我的文档")
doc.load_from_doctags(doctags)
print(doc.export_to_markdown())

Advanced Usage (optional)
- Handling multi-page documents: Process multiple images in a loop and then merge DocTags.
- optimize performance: Settings torch_dtype=torch.bfloat16 Memory saving, GPU users can enable flash_attention_2 Acceleration:
```
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager"
).to(DEVICE)
```

operating skill

Picture Requirements: Images need to be clear and text legible, the higher the resolution the better.
Adjustment parameters: If the result is incomplete, add max_new_tokens(default 8192).
batch file: Multiple images can be passed in as a list images=[image1, image2]The
Commissioning method: Output intermediate result checking, e.g. print inputs Check if the input is correct.

caveat

Internet access is required for the first run, after which it can be used offline.
The image is too large and may lead to a lack of memory, so we recommend cropping it and dealing with it.
If you encounter an error, check that the Python version and dependent libraries are installed correctly.

With the above steps, users can turn images into structured documents with SmolDocling. The whole process is simple and suitable for beginners and professional users.

application scenario

academic research
Convert scanned papers to text, extract formulas and tables for easy editing and citation.
Programming Documentation Organization
Converts manual images containing code to Markdown, preserving code formatting for developers.
office automation
Handle scanned copies of contracts, reports, etc., recognizing layout and content to improve efficiency.
Educational support
Turn textbook images into editable documents to help teachers and students organize their notes.

QA

What is the difference between SmolDocling and SmolVLM?
SmolDocling is based on an optimized version of SmolVLM-256M that focuses on document processing and outputs the DocTags format, while SmolVLM is more general and supports tasks such as image description.
What operating systems are supported?
Windows, Mac, and Linux are supported and can be run with Python and dependent libraries installed.
Is the processing fast?
Processing an image takes only a few seconds on a regular computer, and even faster for GPU users, usually less than 1 second.
Can you handle handwritten text?
Yes, but the results depend on the clarity of the handwriting and it is recommended to use printed text images for best results.

SmolDocling: a visual language model for efficient document processing in a small volume

General Introduction

Function List

Using Help

Installation process

Procedure for use

operating skill

caveat

application scenario

QA

Related articles

Recommended

Can't find AI tools? Try here!

FLUX.1 image generator (supports Chinese input)

Recent AI Hotspots

AI Tools Recommendations

AI Tools Classification