Original text:Video generation models as world simulators
We work on large-scale training of generative models on video data. Specifically, we jointly train text-conditional diffusion-based models for videos and images of different temporal lengths, resolutions, and aspect ratios. We employ a Transformer architecture that is capable of handling spatio-temporal segments potentially encoded in videos and images. Our largest model, Sora, generates high-quality one-minute videos. Our research shows that scaling up video generation models is a promising step toward creating general-purpose tools capable of modeling the physical world.
This technical report focuses on (1) how we transformed various types of visual data into a unified representation that enables large-scale training of generative models, and (2) a qualitative evaluation of the capabilities and limitations of the Sora model. The report does not contain detailed information about the model and implementation.
Many previous studies have explored generative modeling of video data using a variety of approaches, including recurrent networks 1,2,3, generative adversarial networks 4,5,6,7, autoregressive Transformer 8,9, and diffusion models 10,11,12 These studies have typically focused on specific classes of visual data, shorter videos, or fixed-size videos.Sora is a model for generalized modeling of visual data that generates videos and images of various durations, aspect ratios, and resolutions, up to one minute of HD video.
Innovative transformation of visual data: patching techniques
Inspired by the success of large language modeling (LLM) in processing Internet-scale data and developing all-around skills,13,14 we explored how similar advantages could be applied to generative modeling of visual data. The Large Language Model was developed by using tokens -- an efficient way to unify the processing of code, mathematics, and multiple natural languages -- enabling seamless intermodal transitions. In this study, we introduce a counterpart in the visual domain: the visualpatch(patches). It has been shown that patches are an efficient form of visual data representation,15,16,17,18 and they can greatly enhance the ability of generative models to process diverse video and image data.
Specifically, we achieve video-to-patch transformation by first compressing the video data into a low-dimensional potential space,19 and then decomposing it into spatio-temporal patches.
video compression network
We have developed a dimensionality reduction technique,20 which is capable of processing raw video data and generating latent representations that are compressed in both time and space.Sora is trained in this compressed latent space and is capable of generating new video content. In addition, we have developed a decoder that is able to reduce these latent representations to pixel-level video images.
Space-time Patch technology
By processing the compressed video input, we are able to extract a series of spatio-temporal patches that play a role similar to Transformer Tokens in the model. It is worth noting that this scheme is also applicable to image processing, since, essentially, an image can be viewed as a single frame of video. Using a patch-based representation, Sora is able to adapt to videos and images with different resolutions, durations, and aspect ratios. When generating new video content, we can control the size and form of the final video by arranging these randomly initialized patches into a grid of desired sizes.
Transformer Extension for Video Generation
Sora is a diffusion model21,22,23,24,25 ; it is capable of accepting noisy image chunks (and conditional information such as textual cues) as input, and is trained to predict the original "clear" image chunks. It is worth noting that Sora is a diffuse Transformer, and Transformer technology has demonstrated excellent scalability in a number of domains, including language modeling13,14 , computer vision15,16,17,18 , and image generation27,28,29 .
In this study, we find that the diffusion-based Transformer also scales efficiently in the video modeling domain. In the following section, we demonstrate the significant improvement in sample quality brought about by the increase in training resources by comparing video samples under fixed seed and input conditions during training.
Diverse duration, resolution and aspect ratio of the video
Traditional methods for image and video generation typically resize videos to a standard size, e.g., a 4-second long video processed at 256x256 resolution. We found that training directly on the original size of the video provides multiple benefits.
Flexible sampling capabilities
Sora is capable of generating video in a variety of sizes, including 1920x1080p for widescreen, 1080x1920 for portrait, and anything in between. This allows Sora to directly produce content for different devices that matches their native aspect ratios. In addition, it allows us to quickly prototype content at lower sizes before generating full resolution content, all from the same model.
Optimization of composition and layout
Our experiments show that training on the native aspect ratio of a video significantly improves the composition and layout quality of the video. We compared Sora to another training model that crops all training videos to squares, which is the usual practice when training generative models. The video generated by Sora (right side) shows better compositional results than the model cropped to a square (left side), where sometimes the subject matter is only partially shown in the video generated by the cropped model. Sora, on the other hand, is better able to capture the full scene.
language understanding
To develop systems that can generate videos from text, we need a large number of videos and their corresponding text descriptions. We used a relabeling technique introduced in DALL-E 330 and applied it to the videos. First, we trained a model capable of generating detailed descriptions, and then used this model to create textual descriptions for all videos in the training set. We found that training with highly descriptive video descriptions not only improves the accuracy of the text, but also significantly improves the overall quality of the videos.
As with DALL-E 3, we also use GPT to convert short user prompts into detailed instructions, which are then sent to the video generation model. This process allows Sora to produce high-quality videos based on the user's instructions.
Examples of language comprehension skills (click to expand)
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-blue-jeans-and-a-white-t-shirt- taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-blue-jeans-and-a-white-t-shirt- taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-blue-jeans-and-a-white-t-shirt- taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-blue-jeans-and-a-white-t-shirt- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-blue-jeans-and-a-white-t-shirt- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-blue-jeans-and-a-white-t-shirt- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-blue-jeans-and-a-white-t-shirt- taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-blue-jeans-and-a-white-t-shirt- taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-blue-jeans-and-a-white-t-shirt- taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls-and-cowboy-boots -taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls-and-cowboy-boots -taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls-and-cowboy-boots -taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls-and-cowboy-boots -taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls-and-cowboy-boots -taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls-and-cowboy-boots -taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls-and-cowboy-boots -taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls-and-cowboy-boots -taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-woman-wearing-purple-overalls-and-cowboy-boots -taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-Johannesburg-South-Africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-old-man-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-Johannesburg-South-Africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-blue-jeans-and-a-white-t- shirt-taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-a-green-dress-and-a-sun-hat- taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-Johannesburg-South-Africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/a-toy-robot-wearing-purple-overalls-and-cowboy- boots-taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-blue-jeans-and-a- white-t-shirt-taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-blue-jeans-and-a- white-t-shirt-taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-blue-jeans-and-a- white-t-shirt-taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-blue-jeans-and-a- white-t-shirt-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-blue-jeans-and-a- white-t-shirt-taking-a-pleasant-stroll-in-Johannesburg-South-Africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-blue-jeans-and-a- white-t-shirt-taking-a-pleasant-stroll-in-Johannesburg-South-Africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-blue-jeans-and-a- white-t-shirt-taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-blue-jeans-and-a- white-t-shirt-taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-blue-jeans-and-a- white-t-shirt-taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-green-dress-and-a -sun-hat-taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-green-dress-and-a -sun-hat-taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-green-dress-and-a -sun-hat-taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-green-dress-and-a -sun-hat-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-green-dress-and-a -sun-hat-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-green-dress-and-a -sun-hat-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-green-dress-and-a -sun-hat-taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-green-dress-and-a -sun-hat-taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-a-green-dress-and-a -sun-hat-taking-a-pleasant-stroll-in-antarctica-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-purple-overalls-and -cowboy-boots-taking-a-pleasant-stroll-in-mumbai-india-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-purple-overalls-and -cowboy-boots-taking-a-pleasant-stroll-in-mumbai-india-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-purple-overalls-and -cowboy-boots-taking-a-pleasant-stroll-in-mumbai-india-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-purple-overalls-and -cowboy-boots-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-purple-overalls-and -cowboy-boots-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-winter-storm.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-purple-overalls-and -cowboy-boots-taking-a-pleasant-stroll-in-johannesburg-south-africa-during-a-colorful-festival.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-purple-overalls-and -cowboy-boots-taking-a-pleasant-stroll-in-antarctica-during-a-beautiful-sunset.mp4[/videopack]
[videopack width="640" height="360" downloadlink="true"]https://cdn.openai.com/tmp/s/an-adorable-kangaroo-wearing-purple-overalls-and -cowboy-boots-taking-a-pleasant-stroll-in-antarctica-during-a-winter-storm.mp4[/videopack]
Cue function for pictures and videos
All examples and videos shown on our website are converted from text. However, Sora can also accept images or existing videos as input. This feature allows Sora to perform a variety of image and video editing tasks, such as creating seamless looping videos, animating still images, extending the playback time of videos, and more.
Making DALL-E Pictures Move
With just a picture and a prompt, Sora can create videos. Some examples of videos generated based on DALL-E 231 and DALL-E 330 images are shown below.
Video Time Stretch
Sora is also capable of extending videos forward or backward. Below are four videos that all start with a generated video clip that extends backwards. So, although the four videos have different beginnings, they all converge at the same end.
Using this technique, we were able to expand the video forward or backward, creating the perfect infinite loop effect.
Innovative editing from video to video
Diffusion modeling opens up new horizons for image and video editing based on textual cues. Next, we applied it to Sora using one of these innovative approaches, SDEdit,32 This technology empowers Sora to change the style and environment of a video without any prior examples.
Smooth transitions between videos
We can also use Sora to smoothly connect two very different videos so that they transition naturally as if they were one. In the example below, you'll see that the center video subtly blends elements from the left and right videos.
The magical creation of images
Sora's ability to create stunning images is not limited to video. We accomplish this magic by arranging blocks of Gaussian noise in a spatial grid that is only one frame long. In this way, Sora is able to create images of all sizes, up to a maximum resolution of 2048x2048.
A close-up portrait of a lady in the middle of an autumn day with amazing detail and a surprisingly shallow depth of field.
A vibrant coral reef with colorful fish and marine life weaving in and out.
The digital artwork of a young tiger under an apple tree shows the detailed beauty in the matte painting style.
A snow-covered mountain village, cozy cottages and the Northern Lights unfold in exquisite detail, as if shot with a dslr with a 50mm f/1.2 lens.
Emerging simulation capabilities
We found that under large-scale training, the video model demonstrated a compelling set of emergent capabilities. These capabilities give Sora the ability to simulate real-world people, animals, and environments to some degree. Such emergent capabilities do not require any specific pre-determined preferences for 3D space, objects, etc. -- they are purely the result of being driven by the scale of the data.
Three-dimensional spatial coherence. Sora generates videos with dynamic perspective changes. As the camera position and angle change, the characters and scene elements in the video are able to move coherently in three-dimensional space.
Long-range continuity and object persistence. Maintaining temporal continuity when generating long videos has been a challenge. We have observed that Sora is usually able to handle both short- and long-range dependencies effectively. For example, our model is able to maintain the continuous presence of characters, animals, or objects even if they are occluded or moved out of the frame. Similarly, it is able to show the same character multiple times in the same video sample, ensuring its appearance throughout.
Interaction with the world. Sora can sometimes simulate behaviors that simply affect the state of the world. For example, a painter's brushstrokes on a canvas persist over time, or the bite marks left by someone eating a burger.
Analog of the digital world. Sora also simulates digital processes such as video games. It can render the game world and its dynamics in high quality while controlling the Minecraft game character for basic operations. These capabilities can be triggered by simply mentioning the word "Minecraft" as a cue.
These features demonstrate that the ever-expanding scale of video modeling is a promising path to the development of advanced simulators capable of highly simulating the physical as well as the digital world - including the objects, animals, and people within it.
talk over
As a simulator, Sora currently has many limitations. For example, it cannot accurately simulate basic physical interactions like glass breaking. Some interactions, like eating, don't always correctly reflect changes in the state of an object. We've been working on theOpenAI Sora Introduction PageOther common failures of the model are detailed in, including problems such as inconsistencies in long video samples or the sudden appearance of objects.
We believe that Sora's existing capabilities demonstrate that continuing to scale the video model is a promising path toward developing advanced simulators capable of accurately modeling the physical and digital worlds and the objects, animals, and humans within them.
References
-
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhudinov. "Unsupervised learning of video representations using lstms." international conference on machine learning. pmlr, 2015.
-
Chiappa, Silvia, et al. "Recurrent environment simulators." arXiv preprint arXiv:1704.02254 (2017).
-
Ha, David, and Jürgen Schmidhuber. "World models." arXiv preprint arXiv:1803.10122 (2018).
-
Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics. "Advances in neural information processing systems 29 (2016).
-
Tulyakov, Sergey, et al. "Mocogan: Decomposing motion and content for video generation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
-
Clark, Aidan, Jeff Donahue, and Karen Simonyan. "Adversarial video generation on complex datasets." arXiv preprint arXiv:1907.06571 (2019).
-
Brooks, Tim, et al. "Generating long videos of dynamic scenes." Advances in Neural Information Processing Systems 35 (2022): 31769-31781.
-
Yan, Wilson, et al. "Videogpt: video generation using vq-vae and transformers." arXiv preprint arXiv:2104.10157 (2021).
-
Wu, Chenfei, et al. "Nüwa: Visual synthesis pre-training for neural visual world creation." European conference on computer vision. Cham: Springer Nature Switzerland, 2022.
-
Ho, Jonathan, et al. "Imagen video: high definition video generation with diffusion models." arXiv preprint arXiv:2210.02303 (2022).
-
Blattmann, Andreas, et al. "Align your latents: high-resolution video synthesis with latent diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
-
Gupta, Agrim, et al. "Photorealistic video generation with diffusion models." arXiv preprint arXiv:2312.06662 (2023).
-
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
-
Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
-
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
-
Arnab, Anurag, et al. "Vivit: A video vision transformer." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
-
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
-
Dehghani, Mostafa, et al. "Patch n'Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution." arXiv preprint arXiv:2307.06304 (2023).
-
Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
-
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).
-
Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium thermodynamics." International conference on machine learning. PMLR, 2015.
-
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.
-
Nichol, Alexander Quinn, and Prafulla Dhariwal. "Improved denoising diffusion probabilistic models." International Conference on Machine Learning. PMLR, 2021.
-
Dhariwal, Prafulla, and Alexander Quinn Nichol. "Diffusion Models Beat GANs on Image Synthesis." Advances in Neural Information Processing Systems. 2021.
-
Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." Advances in Neural Information Processing Systems 35 (2022): 26565-26577.
-
Peebles, William, and Saining Xie. "Scalable diffusion models with transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
-
Chen, Mark, et al. "Generative pretraining from pixels." International conference on machine learning. PMLR, 2020.
-
Ramesh, Aditya, et al. "Zero-shot text-to-image generation." International Conference on Machine Learning. PMLR, 2021.
-
Yu, Jiahui, et al. "Scaling autoregressive models for content-rich text-to-image generation." arXiv preprint arXiv:2206.10789 2.3 (2022): 5.
-
Betker, James, et al. "Improving image generation with better captions." Computer Science. https://cdn.openai.com/papers/dall-e-3. pdf 2.3 (2023): 8
-
Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 1.2 (2022): 3.
-
Meng, Chenlin, et al. "Sdedit: guided image synthesis and editing with stochastic differential equations." arXiv preprint arXiv:2108.01073 (2021).
Authors
- Tim Brooks
- Bill Peebles
- Connor Holmes
- Will DePue
- Yufei Guo
- Li Jing
- David Schnurr
- Joe Taylor
- Troy Luhman
- Eric Luhman
- Clarence Wing Yin Ng
- Ricky Wang
- Aditya Ramesh
Acknowledgments
Citation
Please cite as OpenAI et al., and use the following bibtex for citation. https://openai.com/bibtex/videoworldsimulators2024.bib