Recently, NVIDIA (NVIDIA) jointly with the Massachusetts Institute of Technology and Tsinghua University, launched an open-source image generation model called SANA. SANA is not only capable of efficiently generating images with a resolution of up to 4096 × 4096, but also has a very fast generation speed.
SANA's performance
SANA is characterized by a fast word, SANA-0.6B takes less than a second to generate 1024×1024 resolution images, 25 times faster than Flux-Dev, and 106 times faster than Flux-Dev to generate 4096×4096 resolution images.
In terms of generation quality, SANA scores equal to Flux in the DPG-Bench test benchmark and only slightly lower than the Flux model in the GenEval metric.
SANA's core design
SANA's success cannot be separated from its four core designs:
1. Depth compression autoencoder (DC-AE)
While conventional autoencoders (AEs) typically compress images by a factor of 8, SANA introduces a deep compression autoencoder that increases the compression factor to 32. This design dramatically reduces the number of potential markers, enabling SANA to efficiently generate ultra-high resolution images (e.g., 4K resolution) while significantly reducing the computational cost of training and generation.
2. Linear DIT (Diffusion Image Transformer)
SANA employs a new linear attention mechanism instead of the traditional quadratic attention mechanism, reducing the complexity from O(N²) to O(N). This improvement not only increases the efficiency of high-resolution image generation, but also eliminates the need for positional coding, marking the first DIT model that does not require positional embedding.
3. Small decoder-only LLMs as text encoders
SANA uses small decoder-only language models (such as Gemma 2) as text encoders, replacing traditional CLIP or T5 models.Gemma has superior text comprehension and instruction adherence capabilities, which combined with sophisticated manual instruction design, significantly improves image-to-text alignment.
4. Efficient training and reasoning strategies
SANA proposes an automatic labeling and training strategy that generates different re-captions with multiple visual language models (VLMs) and selects high-quality captions based on CLIPScore, thus accelerating model convergence and enhancing text-image alignment. In addition, SANA introduces Flow-DPM-Solver, which drastically reduces the inference steps and further improves the generation efficiency.
Low-cost deployment and open source
Another highlight of SANA is its low-cost deployability. SANA-0.6B can run on a 16GB laptop GPU, generating 1024×1024 resolution images in less than 1 second, and 22GB of video memory can straighten out 4096×4096 resolution images, a feature that makes SANA not only suitable for high-end computing devices, but also can run efficiently on ordinary users' laptops. In addition, NVIDIA also announced that it will publicly release the code and model of SANA, further promoting the popularization and application of text-to-image generation technology.
utilization
NVIDIA has built eight 3090 web use interfaces that are free for all to try. It is worth mentioning that the SANA model can be used directly with Chinese prompt words.
Even the use of cue words with icon symbols is possible, which should benefit from the use of the Gemma2 2B visual language model as a text encoder.
With the ComfyUI_ExtraModels plugin, it is very easy to use SANA models on local Comfyui. Plugin installation is very simple , do not need to configure their own dependencies , run after installation will automatically download the required model files .
Through deep compression autoencoder, linear DIT, small LLM with only decoder, and efficient training and inference strategies, SANA can not only efficiently generate ultra-high resolution images, but also has a strong text-image alignment capability and low-cost deployment advantages. For those who need to quickly produce images, SANA is still good, that is, in terms of ecology can not be compared with Flux.
Project page:
github.com/NVlabs/Sana
Web use:
nv-sana.mit.edu
Comfyui plugin:
github.com/Efficient-Large-Model/ComfyUI_ExtraModels