General Introduction
SmolDocling is a Visual Language Model (VLM) developed by the ds4sd team in collaboration with IBM, based on SmolVLM-256M and hosted on the Hugging Face platform. It is the world's smallest VLM with only 256M parameters, and its core function is to extract text from images, recognize layouts, codes, formulas, and charts, and generate structured documents in DocTags format. smolDocling can run on ordinary devices with high efficiency and low resource consumption. The development team is sharing this model through open source in the hope of helping more people deal with document tasks. It is part of the SmolVLM family, which specializes in document conversion and is suitable for users who need to process complex documents quickly.
Function List
- Text Extraction (OCR): Recognize and extract text from images, support multi-language.
- Layout Recognition: Analyze the structure of a document in a picture, such as the position of headings, paragraphs, and tables.
- code recognition: Extracts code blocks and preserves indentation and formatting.
- formula recognition: Detect math formulas and convert to editable text.
- chart recognition: Parses the content of the chart in the image and extracts the data.
- Forms processing: Recognizes the structure of a table and retains row and column information.
- DocTags Output: Convert processing results into a uniform markup format for easy subsequent use.
- High Resolution Image Processing: Supports larger resolution image input to improve recognition accuracy.
Using Help
The use of SmolDocling is divided into two parts: installation and operation. Below are detailed steps to help users get started quickly.
Installation process
- Preparing the environment
- Make sure your computer has Python 3.8 or later installed.
- Install the dependency libraries by entering the following command in the terminal:
pip install torch transformers docling_core
- If you have a GPU, it is recommended to install PyTorch with CUDA support to run faster. Check the methodology:
import torch print("GPU available:" if torch.cuda.is_available() else "Using CPU")
- Loading Models
- SmolDocling doesn't need to be downloaded manually, it is available directly from Hugging Face via code.
- Ensure that the network is open and that the first run will automatically download the model file.
Procedure for use
- Prepare the picture
- Find an image that contains text, such as a scanned document or a screenshot.
- Load the image with code:
from transformers.image_utils import load_image image = load_image("Path to your image.jpg")
- Initialize the model and processor
- Load SmolDocling's processor and model:
from transformers import AutoProcessor, AutoModelForVision2Seq DEVICE = "cuda" if torch.cuda.is_available() else "cpu" processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview") model = AutoModelForVision2Seq.from_pretrained( "ds4sd/SmolDocling-256M-preview", torch_dtype=torch.bfloat16 ).to(DEVICE)
- Load SmolDocling's processor and model:
- Generate DocTags
- Set up the inputs and run the model:
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Convert this page to docling."}]}]] prompt = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(text=prompt, images=[image], return_tensors="pt").to(DEVICE) generated_ids = model.generate(**inputs, max_new_tokens=8192) doctags = processor.batch_decode(generated_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=False)[0].lstrip() print(doctags)
- Set up the inputs and run the model:
- Conversion to common formats
- Convert DocTags to Markdown or other formats:
from docling_core.types.doc import DoclingDocument doc = DoclingDocument(name="My Document") doc.load_from_doctags(doctags) print(doc.export_to_markdown())
- Convert DocTags to Markdown or other formats:
- Advanced Usage (optional)
- Handling multi-page documents: Process multiple images in a loop and then merge DocTags.
- optimize performance: Settings
torch_dtype=torch.bfloat16
Memory saving, GPU users can enableflash_attention_2
Acceleration:model = AutoModelForVision2Seq.from_pretrained( "ds4sd/SmolDocling-256M-preview", torch_dtype=torch.bfloat16, _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager" ).to(DEVICE)
operating skill
- Picture Requirements: Images need to be clear and text legible, the higher the resolution the better.
- Adjustment parameters: If the result is incomplete, add
max_new_tokens
(default 8192). - batch file: Multiple images can be passed in as a list
images=[image1, image2]
The - Commissioning method: Output intermediate result checking, e.g. print
inputs
Check if the input is correct.
caveat
- Internet access is required for the first run, after which it can be used offline.
- The image is too large and may lead to a lack of memory, so we recommend cropping it and dealing with it.
- If you encounter an error, check that the Python version and dependent libraries are installed correctly.
With the above steps, users can turn images into structured documents with SmolDocling. The whole process is simple and suitable for beginners and professional users.
application scenario
- academic research
Convert scanned papers to text, extract formulas and tables for easy editing and citation. - Programming Documentation Organization
Converts manual images containing code to Markdown, preserving code formatting for developers. - office automation
Handle scanned copies of contracts, reports, etc., recognizing layout and content to improve efficiency. - Educational support
Turn textbook images into editable documents to help teachers and students organize their notes.
QA
- What is the difference between SmolDocling and SmolVLM?
SmolDocling is based on an optimized version of SmolVLM-256M that focuses on document processing and outputs the DocTags format, while SmolVLM is more general and supports tasks such as image description. - What operating systems are supported?
Windows, Mac, and Linux are supported and can be run with Python and dependent libraries installed. - Is the processing fast?
Processing an image takes only a few seconds on a regular computer, and even faster for GPU users, usually less than 1 second. - Can you handle handwritten text?
Yes, but the results depend on the clarity of the handwriting and it is recommended to use printed text images for best results.