Cosmos: World Base Model, a platform for building AI base models of the physical world

Latest AI Resources7mos agorelease AI Sharing Circle

1.6K 00

General Introduction

NVIDIA Cosmos is a world base model platform for developers specifically designed to help physical AI developers build their physical AI systems better and faster. The platform offers a range of pre-trained models, including diffusion and autoregressive based world base models, as well as tokenizers for efficient video processing.NVIDIA Cosmos supports features such as Text2World and Video2World generation, which are capable of generating visual simulations based on textual cues or video input. generating visual simulations based on textual cues or video input. The platform is released as open source under the Apache 2 license for model training and fine-tuning scripts, and the NVIDIA Open Model License for pre-trained models. The platform is specifically optimized for understanding and generating physical scenes, providing a powerful base model for robotics and autonomous driving.

What is NVIDIA Cosmos?
NVIDIA Cosmos™ is a state-of-the-art generative World Foundation Model (WFM) platform that includes advanced tokenizers, guarding mechanisms, and accelerated data processing and management flows designed to accelerate the development of physical AI systems such as self-driving cars (AVs) and robots. A family of pre-trained models for generating physically-aware video and world states built specifically for physical AI development.

Online experience: https://build.nvidia.com/explore/discover

Function List

Provides diffusion-based world base model with Text2World and Video2World generation support
Provide autoregressive based world base model with Video2World generation support
Efficient video tokenizer, supports continuous and discrete token video conversion
Post-training scripts for pre-trained models for adaptation to different physical AI scenarios
Video dataset management process tool (coming soon)
Complete training scripts to support the construction of customized world base models
Built-in security protection system to ensure the security of generated content
Supports multiple model sizes (4B/5B/12B/13B parameters) to accommodate different hardware configurations
Flexible model offloading strategy to support operation in low graphics memory environments

Using Help

1. Environmental configuration

First you need to set up the Docker environment, follow the installation guide to configure the required environment. All commands need to be run within Docker.

2. Model downloads

Generate Hugging Face access tokens with "Read" permissions.
Use the following command to log in to Hugging Face:

huggingface-cli login

Download Cosmos model weights:

PYTHONPATH=$(pwd) python cosmos1/scripts/download_autoregressive.py --model_sizes 4B 5B 12B 13B

3. Model types and scenarios of use

Cosmos offers two main types of models:

Base Models

Model versions: 4B and 12B parametric scales
Main features: Support for generating world analog from image/video inputs
Applicable scenarios: need to extend and predict scenes based on existing visual content

Video2World model

Model versions: 5B and 13B parametric scales
Key features: Support for simultaneous use of text and image/video inputs to generate world simulations
Scenario: Need for targeted generation and modification of visual content based on textual descriptions

4. Generative capacity and performance indicators

Supports generation of video sequences of up to 33 frames
Input support for single image or 9 frames of video
Resolution fixed at 1024x640
Inference time on H100 GPUs:
- Model 4B: approximately 62 seconds
- Model 12B: approximately 119 seconds
- 5B Video2World model: approx. 73 seconds
- 13B Video2World model: approx. 150 seconds

5. Memory optimization strategy

Cosmos offers a variety of video memory optimization options that can be used to reduce video memory usage through different model offloading strategies:

No optimization strategy: 4B model requires 31.3GB, 12B model requires 47.5GB
Fully optimized strategy: down to 18.7GB for 4B models and 27.4GB for 12B models
The Video2World model also offers similar optimization options

6. Security functions

Built-in non-disable security protection system
Automatic detection and blurring of face content
Content security filtering ensures that generated results comply with security standards