AI Personal Learning
and practical guidance
Resource Recommendation 1

Hugging Face Introduces SmolVLM, a Small Multimodal Model that Runs on End Devices

SmolVLM is a small multimodal model with a parameter count of 2 billion that accepts input from any combination of images and text and generates textual output.

Hugging Face Introduces SmolVLM-1, a Small Multimodal Model that Runs on End Devices


After launching the SmolLM lightweight language model in July, Hugging Face, an AI application development platform, this week released SmolVLM, a lightweight multimodal model that focuses on lightweight and high performance, adding to its lineup of small language models.

SmolVLM is a small multimodal model with 2 billion references and is known as the performance leader in its class (State-of-the-Art, SOTA). It is capable of accepting any combination of images and text as input, but as a lightweight model, will only generate textual output.SmolVLM can answer questions about images, describe the content of an image, tell a story based on multiple images, or be used as a purely linguistic model. According to the development team, SmolVLM is based on a lightweight architecture that is well suited to run on devices while still performing multimodal tasks well.

SmolVLM's architecture is based on Hugging Face's previously introduced vision model, IDEFICS 3, and even Transformer The Hugging Face implementation is the same. However, Hugging Face has the same implementation of IDEFICS Several improvements have been made. First, the core of the language model was switched from Llama 3.1 8B to SmolLM2 1.7B. Second, SmolVLM uses more advanced image compression techniques, such as pixel shuffle strategy and larger patches for visual Token coding, resulting in improved coding efficiency, faster inference, and less memory usage.

Hugging Face emphasized the efficiency and memory usage advantages of SmolVLM and published test data comparing it to equivalent parametric models. SmolVLM outperforms models such as InternVL2, PaliGemma, MM1.5, moondream, and MiniCPM-V-2 in multimodal comprehension, reasoning, math, and text comprehension. It also outperforms most models in terms of GPU memory usage efficiency. Compared to Alibaba's Qwen2-V2, SmolVLM delivers 3.3 to 4.5 times faster pre-population throughput and 7.5 to 16 times higher generation throughput.

Hugging Face has released three model versions of the SmolVLM family, including SmolVLM-Base for fine-tuning, SmolVLM-Synthetic for fine-tuning based on synthetic datasets, and the command-fine-tuned version, SmolVLM Instruct, which is ready for direct end-user interaction. All model checkpoints, training datasets, training methods, and tools for SmolVLM are based on the Apache 2.0open source licenseThe

Tools Download
May not be reproduced without permission:Chief AI Sharing Circle " Hugging Face Introduces SmolVLM, a Small Multimodal Model that Runs on End Devices

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish