AI Personal Learning
and practical guidance
CyberKnife Drawing Mirror

SmolDocling: a visual language model for efficient document processing in a small volume

General Introduction

SmolDocling is a Visual Language Model (VLM) developed by the ds4sd team in collaboration with IBM, based on SmolVLM-256M and hosted on the Hugging Face platform. It is the world's smallest VLM with only 256M parameters, and its core function is to extract text from images, recognize layouts, codes, formulas, and charts, and generate structured documents in DocTags format. smolDocling can run on ordinary devices with high efficiency and low resource consumption. The development team is sharing this model through open source in the hope of helping more people deal with document tasks. It is part of the SmolVLM family, which specializes in document conversion and is suitable for users who need to process complex documents quickly.

SmolDocling: a visual language model for efficient processing of documents in small size-1


 

SmolDocling: a visual language model for efficient processing of documents in small size-1

 

Function List

  • Text Extraction (OCR): Recognize and extract text from images, support multi-language.
  • Layout Recognition: Analyze the structure of a document in a picture, such as the position of headings, paragraphs, and tables.
  • code recognition: Extracts code blocks and preserves indentation and formatting.
  • formula recognition: Detect math formulas and convert to editable text.
  • chart recognition: Parses the content of the chart in the image and extracts the data.
  • Forms processing: Recognizes the structure of a table and retains row and column information.
  • DocTags Output: Convert processing results into a uniform markup format for easy subsequent use.
  • High Resolution Image Processing: Supports larger resolution image input to improve recognition accuracy.

 

Using Help

The use of SmolDocling is divided into two parts: installation and operation. Below are detailed steps to help users get started quickly.

Installation process

  1. Preparing the environment
    • Make sure your computer has Python 3.8 or later installed.
    • Install the dependency libraries by entering the following command in the terminal:
      pip install torch transformers docling_core
      
    • If you have a GPU, it is recommended to install PyTorch with CUDA support to run faster. Check the methodology:
      import torch
      print("GPU available:" if torch.cuda.is_available() else "Using CPU")
      
  2. Loading Models
    • SmolDocling doesn't need to be downloaded manually, it is available directly from Hugging Face via code.
    • Ensure that the network is open and that the first run will automatically download the model file.

Procedure for use

  1. Prepare the picture
    • Find an image that contains text, such as a scanned document or a screenshot.
    • Load the image with code:
      from transformers.image_utils import load_image
      image = load_image("Path to your image.jpg")
      
  2. Initialize the model and processor
    • Load SmolDocling's processor and model:
      from transformers import AutoProcessor, AutoModelForVision2Seq
      DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
      processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
      model = AutoModelForVision2Seq.from_pretrained(
      "ds4sd/SmolDocling-256M-preview",
      torch_dtype=torch.bfloat16
      ).to(DEVICE)
      
  3. Generate DocTags
    • Set up the inputs and run the model:
      messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Convert this page to docling."}]}]]
      prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
      inputs = processor(text=prompt, images=[image], return_tensors="pt").to(DEVICE)
      generated_ids = model.generate(**inputs, max_new_tokens=8192)
      doctags = processor.batch_decode(generated_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=False)[0].lstrip()
      print(doctags)
      
  4. Conversion to common formats
    • Convert DocTags to Markdown or other formats:
      from docling_core.types.doc import DoclingDocument
      doc = DoclingDocument(name="My Document")
      doc.load_from_doctags(doctags)
      print(doc.export_to_markdown())
      
  5. Advanced Usage (optional)
    • Handling multi-page documents: Process multiple images in a loop and then merge DocTags.
    • optimize performance: Settings torch_dtype=torch.bfloat16 Memory saving, GPU users can enable flash_attention_2 Acceleration:
      model = AutoModelForVision2Seq.from_pretrained(
      "ds4sd/SmolDocling-256M-preview",
      torch_dtype=torch.bfloat16,
      _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager"
      ).to(DEVICE)
      

operating skill

  • Picture Requirements: Images need to be clear and text legible, the higher the resolution the better.
  • Adjustment parameters: If the result is incomplete, add max_new_tokens(default 8192).
  • batch file: Multiple images can be passed in as a list images=[image1, image2]The
  • Commissioning method: Output intermediate result checking, e.g. print inputs Check if the input is correct.

caveat

  • Internet access is required for the first run, after which it can be used offline.
  • The image is too large and may lead to a lack of memory, so we recommend cropping it and dealing with it.
  • If you encounter an error, check that the Python version and dependent libraries are installed correctly.

With the above steps, users can turn images into structured documents with SmolDocling. The whole process is simple and suitable for beginners and professional users.

 

application scenario

  1. academic research
    Convert scanned papers to text, extract formulas and tables for easy editing and citation.
  2. Programming Documentation Organization
    Converts manual images containing code to Markdown, preserving code formatting for developers.
  3. office automation
    Handle scanned copies of contracts, reports, etc., recognizing layout and content to improve efficiency.
  4. Educational support
    Turn textbook images into editable documents to help teachers and students organize their notes.

 

QA

  1. What is the difference between SmolDocling and SmolVLM?
    SmolDocling is based on an optimized version of SmolVLM-256M that focuses on document processing and outputs the DocTags format, while SmolVLM is more general and supports tasks such as image description.
  2. What operating systems are supported?
    Windows, Mac, and Linux are supported and can be run with Python and dependent libraries installed.
  3. Is the processing fast?
    Processing an image takes only a few seconds on a regular computer, and even faster for GPU users, usually less than 1 second.
  4. Can you handle handwritten text?
    Yes, but the results depend on the clarity of the handwriting and it is recommended to use printed text images for best results.
May not be reproduced without permission:Chief AI Sharing Circle " SmolDocling: a visual language model for efficient document processing in a small volume

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish