Hugging Face Introduces SmolVLM, a Small Multimodal Model that Runs on End Devices

AI News8mos agorelease AI Sharing Circle

1.9K 00

SmolVLM is a small multimodal model with a parameter count of 2 billion that accepts input from any combination of images and text and generates textual output.

Hugging Face 推出可在终端设备上运行的小型多模态模型 SmolVLM

After launching the SmolLM lightweight language model in July, Hugging Face, an AI application development platform, this week released SmolVLM, a lightweight multimodal model that focuses on lightweight and high performance, adding to its lineup of small language models.

SmolVLM is a small multimodal model with 2 billion references and is known as the performance leader in its class (State-of-the-Art, SOTA). It is capable of accepting any combination of images and text as input, but as a lightweight model, will only generate textual output.SmolVLM can answer questions about images, describe the content of an image, tell a story based on multiple images, or be used as a purely linguistic model. According to the development team, SmolVLM is based on a lightweight architecture that is well suited to run on devices while still performing multimodal tasks well.

SmolVLM's architecture is based on Hugging Face's previously introduced vision model, IDEFICS 3, and even Transformer The Hugging Face implementation is the same. However, Hugging Face has the same implementation of IDEFICS Several improvements have been made. First, the core of the language model was switched from Llama 3.1 8B to SmolLM2 1.7B. Second, SmolVLM uses more advanced image compression techniques, such as pixel shuffle strategy and larger patches for visual Token coding, resulting in improved coding efficiency, faster inference, and less memory usage.

Hugging Face emphasized the efficiency and memory usage advantages of SmolVLM and published test data comparing it to equivalent parametric models. SmolVLM outperforms models such as InternVL2, PaliGemma, MM1.5, moondream, and MiniCPM-V-2 in multimodal comprehension, reasoning, math, and text comprehension. It also outperforms most models in terms of GPU memory usage efficiency. Compared to Alibaba's Qwen2-V2, SmolVLM delivers 3.3 to 4.5 times faster pre-population throughput and 7.5 to 16 times higher generation throughput.

Hugging Face has released three model versions of the SmolVLM family, including SmolVLM-Base for fine-tuning, SmolVLM-Synthetic for fine-tuning based on synthetic datasets, and the command-fine-tuned version, SmolVLM Instruct, which is ready for direct end-user interaction. All model checkpoints, training datasets, training methods, and tools for SmolVLM are based on the Apache 2.0open source licenseThe

AI News

文章版权归 AI Sharing Circle 所有，未经允许请勿转载。

免费的FLUX模型生成的图片已经不输DALL·E-3、Midjourney、Stable Diffusion了！

The free FLUX model generates images that are no longer inferior to DALL-E-3, Midjourney, or Stable Diffusion!

AI News

12mos ago

02.6K

Copilot new version of the much-touted, the old version of the entrance is still available to upload pdf file analysis is very powerful!

AI News

10mos ago

01.8K

MediaTek's Open Source Traditional Chinese Multimodal Model and Taiwan Accent Speech Synthesis Model

AI News

6mos ago

01.5K

Bing's "deep search" feature is opening up to more users

AI News

2yrs ago

02.2K

No comments

You must be logged in to leave a comment!

No comments...

Hugging Face Introduces SmolVLM, a Small Multimodal Model that Runs on End Devices

New Release｜Copilot + Agents, a new era of intelligent low-code development

OpenAI Hopes to Grow to 1 Billion Users Next Year, Will Aggressively Expand Data Centers

Related posts

The free FLUX model generates images that are no longer inferior to DALL-E-3, Midjourney, or Stable Diffusion!

Copilot new version of the much-touted, the old version of the entrance is still available to upload pdf file analysis is very powerful!

MediaTek's Open Source Traditional Chinese Multimodal Model and Taiwan Accent Speech Synthesis Model

Bing's "deep search" feature is opening up to more users

No comments

Latest Collections

Latest Articles

Hugging Face Introduces SmolVLM, a Small Multimodal Model that Runs on End Devices

New Release｜Copilot + Agents, a new era of intelligent low-code development

OpenAI Hopes to Grow to 1 Billion Users Next Year, Will Aggressively Expand Data Centers

Related posts

The free FLUX model generates images that are no longer inferior to DALL-E-3, Midjourney, or Stable Diffusion!

Copilot new version of the much-touted, the old version of the entrance is still available to upload pdf file analysis is very powerful!

MediaTek's Open Source Traditional Chinese Multimodal Model and Taiwan Accent Speech Synthesis Model

Bing's "deep search" feature is opening up to more users

No comments

Selected AI Tools

Latest Collections

Latest Articles