SmolVLM is a small multimodal model with a parameter count of 2 billion that accepts input from any combination of images and text and generates textual output.
After launching the SmolLM lightweight language model in July, Hugging Face, an AI application development platform, this week released SmolVLM, a lightweight multimodal model that focuses on lightweight and high performance, adding to its lineup of small language models.
SmolVLM is a small multimodal model with 2 billion references and is known as the performance leader in its class (State-of-the-Art, SOTA). It is capable of accepting any combination of images and text as input, but as a lightweight model, will only generate textual output.SmolVLM can answer questions about images, describe the content of an image, tell a story based on multiple images, or be used as a purely linguistic model. According to the development team, SmolVLM is based on a lightweight architecture that is well suited to run on devices while still performing multimodal tasks well.
SmolVLM's architecture is based on Hugging Face's previously introduced vision model, IDEFICS 3, and even Transformer The Hugging Face implementation is the same. However, Hugging Face has the same implementation of IDEFICS Several improvements have been made. First, the core of the language model was switched from Llama 3.1 8B to SmolLM2 1.7B. Second, SmolVLM uses more advanced image compression techniques, such as pixel shuffle strategy and larger patches for visual Token coding, resulting in improved coding efficiency, faster inference, and less memory usage.
Hugging Face emphasized the efficiency and memory usage advantages of SmolVLM and published test data comparing it to equivalent parametric models. SmolVLM outperforms models such as InternVL2, PaliGemma, MM1.5, moondream, and MiniCPM-V-2 in multimodal comprehension, reasoning, math, and text comprehension. It also outperforms most models in terms of GPU memory usage efficiency. Compared to Alibaba's Qwen2-V2, SmolVLM delivers 3.3 to 4.5 times faster pre-population throughput and 7.5 to 16 times higher generation throughput.
Hugging Face has released three model versions of the SmolVLM family, including SmolVLM-Base for fine-tuning, SmolVLM-Synthetic for fine-tuning based on synthetic datasets, and the command-fine-tuned version, SmolVLM Instruct, which is ready for direct end-user interaction. All model checkpoints, training datasets, training methods, and tools for SmolVLM are based on the Apache 2.0open source licenseThe