General Introduction
NVIDIA Cosmos is a world base model platform for developers specifically designed to help physical AI developers build their physical AI systems better and faster. The platform offers a range of pre-trained models, including diffusion and autoregressive based world base models, as well as tokenizers for efficient video processing.NVIDIA Cosmos supports features such as Text2World and Video2World generation, which are capable of generating visual simulations based on textual cues or video input. generating visual simulations based on textual cues or video input. The platform is released as open source under the Apache 2 license for model training and fine-tuning scripts, and the NVIDIA Open Model License for pre-trained models. The platform is specifically optimized for understanding and generating physical scenes, providing a powerful base model for robotics and autonomous driving.
What is NVIDIA Cosmos?
NVIDIA Cosmos™ is a state-of-the-art generative World Foundation Model (WFM) platform that includes advanced tokenizers, guarding mechanisms, and accelerated data processing and management flows designed to accelerate the development of physical AI systems such as self-driving cars (AVs) and robots. A family of pre-trained models for generating physically-aware video and world states built specifically for physical AI development.
Function List
- Provides diffusion-based world base model with Text2World and Video2World generation support
- Provide autoregressive based world base model with Video2World generation support
- Efficient video tokenizer, supports continuous and discrete token video conversion
- Post-training scripts for pre-trained models for adaptation to different physical AI scenarios
- Video dataset management process tool (coming soon)
- Complete training scripts to support the construction of customized world base models
- Built-in security protection system to ensure the security of generated content
- Supports multiple model sizes (4B/5B/12B/13B parameters) to accommodate different hardware configurations
- Flexible model offloading strategy to support operation in low graphics memory environments
Using Help
1. Environmental configuration
First you need to set up the Docker environment, follow the installation guide to configure the required environment. All commands need to be run within Docker.
2. Model downloads
- Generate Hugging Face access tokens with "Read" permissions.
- Use the following command to log in to Hugging Face:
huggingface-cli login
- Download Cosmos model weights:
PYTHONPATH=$(pwd) python cosmos1/scripts/download_autoregressive.py --model_sizes 4B 5B 12B 13B
3. Model types and scenarios of use
Cosmos offers two main types of models:
Base Models
- Model versions: 4B and 12B parametric scales
- Main features: Support for generating world analog from image/video inputs
- Applicable scenarios: need to extend and predict scenes based on existing visual content
Video2World model
- Model versions: 5B and 13B parametric scales
- Key features: Support for simultaneous use of text and image/video inputs to generate world simulations
- Scenario: Need for targeted generation and modification of visual content based on textual descriptions
4. Generative capacity and performance indicators
- Supports generation of video sequences of up to 33 frames
- Input support for single image or 9 frames of video
- Resolution fixed at 1024x640
- Inference time on H100 GPUs:
- Model 4B: approximately 62 seconds
- Model 12B: approximately 119 seconds
- 5B Video2World model: approx. 73 seconds
- 13B Video2World model: approx. 150 seconds
5. Memory optimization strategy
Cosmos offers a variety of video memory optimization options that can be used to reduce video memory usage through different model offloading strategies:
- No optimization strategy: 4B model requires 31.3GB, 12B model requires 47.5GB
- Fully optimized strategy: down to 18.7GB for 4B models and 27.4GB for 12B models
- The Video2World model also offers similar optimization options
6. Security functions
- Built-in non-disable security protection system
- Automatic detection and blurring of face content
- Content security filtering ensures that generated results comply with security standards