StreamingT2V: A Dynamic and Scalable Generation Technique from Text to Long Video

Latest AI Resources9mos agoupdate AI Sharing Circle

2.8K 00

General Introduction

StreamingT2V is a public project developed by the Picsart AI research team focused on generating coherent, dynamic and scalable long videos based on textual descriptions. This technology uses an advanced autoregressive approach that guarantees a temporally consistent video that closely corresponds to the description text and maintains high frame quality images. It is capable of generating videos up to 1200 fps and up to two minutes in length, with the potential to scale to longer periods of time. The effectiveness of the technique is not limited by a specific Text2Video model, i.e., improvements in the model will further enhance the video quality.

StreamingT2V Online Experience

Function List

Supports generation of videos up to 1200 fps and up to two minutes in length.
Maintains temporal consistency of video and high frame quality images
Dynamic video generation that closely corresponds to the text description
Supports multiple Base modeling applications to enhance the quality of generated videos
Support Text-to-Video and Image-to-Video conversion
Provide Gradio online demo

Using Help

Clone the project repository and install the required environment
Download weights and place them in the correct directory
Run the sample code for text-to-video or image-to-video conversion
View the project page for detailed results and demos

inference time

ModelscopeT2V as a base model

frame rate	Faster preview inference time (256×256)	Reasoning time for final result (720×720)
24 frames	40 seconds.	165 seconds.
56 frames	75 seconds.	360 seconds
80 frames	110 seconds.	525 seconds.
240 frames	340 seconds.	1610 seconds (about 27 minutes)
600 frames	860 seconds.	5128 seconds (about 85 minutes)
1200 frames.	1710 seconds (about 28 minutes)	10225 seconds (about 170 minutes)

AnimateDiffas a base model

frame rate	Faster preview inference time (256×256)	Reasoning time for final result (720×720)
24 frames	50 seconds.	180 seconds.
56 frames	85 seconds.	370 seconds.
80 frames	120 seconds.	535 seconds.
240 frames	350 seconds.	1620 seconds (about 27 minutes)
600 frames	870 seconds.	5138 seconds (~85 minutes)
1200 frames.	1720 seconds (about 28 minutes)	10235 seconds (about 170 minutes)

SVDAs a basic model

frame rate	Faster preview inference time (256×256)	Reasoning time for final result (720×720)
24 frames	80 seconds.	210 seconds.
56 frames	115 seconds.	400 seconds.
80 frames	150 seconds.	565 seconds.
240 frames	380 seconds.	1650 seconds (about 27 minutes)
600 frames	900 seconds.	5168 seconds (~86 minutes)
1200 frames.	1750 seconds (approx. 29 minutes)	10265 seconds (~171 minutes)

All measurements were taken using the NVIDIA A100 (80 GB) GPU. When the number of frames exceeded 80, random blending was used. For random mixing, thechunk_sizeand the value ofoverlap_sizeare set to 112 and 32, respectively.

Article copyright AI Sharing Circle All, please do not reproduce without permission.

AnythingLLM：使用AI与文档智能对话，自定义多种智能体，完全本地化客户端。

AnythingLLM: Use AI to talk intelligently to documents, customize multiple intelligences, fully localize the client.

Latest AI Resources # AI Localized Chat Application

8mos ago

03.2K

Ragie: Automatic synchronization of user data and fully managed RAG service for developers

Latest AI Resources # AI Open Services

6mos ago

02.7K

MetaWorld AI: Open Source Version of AI Digital Human Cloning and Short Video Generation Tool

Latest AI Resources # AI Side Hustle Money Making Programs # AI Java Open Source Projecct # AI Digital Man

5mos ago

01.5K

CortexON: Open source multi-agent AI tool to support automation of complex tasks

Latest AI Resources # AI Java Open Source Projecct # Intelligent Body Application

5mos ago

02.1K

No comments

You must be logged in to leave a comment!

No comments...

StreamingT2V: A Dynamic and Scalable Generation Technique from Text to Long Video

General Introduction

Function List