AI Personal Learning
and practical guidance
豆包Marscode1

Hugging Face Introduces SmolVLM, a Small Multimodal Model that Runs on End Devices

SmolVLM is a small multimodal model with a parameter count of 2 billion that accepts input from any combination of images and text and generates textual output.

Hugging Face 推出可在终端设备上运行的小型多模态模型 SmolVLM-1


After launching the SmolLM lightweight language model in July, Hugging Face, an AI application development platform, this week released SmolVLM, a lightweight multimodal model that focuses on lightweight and high performance, adding to its lineup of small language models.

SmolVLM is a small multimodal model with 2 billion references and is known as the performance leader in its class (State-of-the-Art, SOTA). It is capable of accepting any combination of images and text as input, but as a lightweight model, will only generate textual output.SmolVLM can answer questions about images, describe the content of an image, tell a story based on multiple images, or be used as a purely linguistic model. According to the development team, SmolVLM is based on a lightweight architecture that is well suited to run on devices while still performing multimodal tasks well.

SmolVLM's architecture is based on Hugging Face's previously introduced vision model, IDEFICS 3, and even Transformer The Hugging Face implementation is the same. However, Hugging Face has the same implementation of IDEFICS Several improvements have been made. First, the core of the language model was switched from Llama 3.1 8B to SmolLM2 1.7B. Second, SmolVLM uses more advanced image compression techniques, such as pixel shuffle strategy and larger patches for visual Token coding, resulting in improved coding efficiency, faster inference, and less memory usage.

Hugging Face emphasized the efficiency and memory usage advantages of SmolVLM and published test data comparing it to equivalent parametric models. SmolVLM outperforms models such as InternVL2, PaliGemma, MM1.5, moondream, and MiniCPM-V-2 in multimodal comprehension, reasoning, math, and text comprehension. It also outperforms most models in terms of GPU memory usage efficiency. Compared to Alibaba's Qwen2-V2, SmolVLM delivers 3.3 to 4.5 times faster pre-population throughput and 7.5 to 16 times higher generation throughput.

Hugging Face has released three model versions of the SmolVLM family, including SmolVLM-Base for fine-tuning, SmolVLM-Synthetic for fine-tuning based on synthetic datasets, and the command-fine-tuned version, SmolVLM Instruct, which is ready for direct end-user interaction. All model checkpoints, training datasets, training methods, and tools for SmolVLM are based on the Apache 2.0open source licenseThe

May not be reproduced without permission:Chief AI Sharing Circle " Hugging Face Introduces SmolVLM, a Small Multimodal Model that Runs on End Devices
en_USEnglish