General Introduction
STAR (Spatial-Temporal Augmentation with Text-to-Video Models) is an innovative video super-resolution framework jointly developed by Nanjing University, ByteDance and Southwest University. The project is dedicated to solving key problems in real-world video super-resolution processing, and realizes high-quality enhancement of video frames by combining the a priori knowledge of text-to-video (T2V) diffusion models.The distinguishing feature of the STAR model is its ability to simultaneously maintain spatial detail realism and temporal consistency, which is often difficult to be balanced in traditional GAN-based approaches. The project provides two versions of implementation: a lightweight and heavy quality reduction processing model based on I2VGen-XL, and a heavy quality reduction processing model based on CogVideoX-5B, which is capable of adapting to the needs of video enhancement in different scenarios.
Function List
- Supports super-resolution reconstruction for many types of video degradation processing (light and heavy)
- Automated cue word generation, support for using tools such as Pllava to generate video descriptions
- Provision of an online demo platform (HuggingFace Spaces)
- Support 720x480 resolution video input processing
- Provide complete inference code and pre-trained models
- Integration of Local Information Enhancement Module (LIEM) to improve the quality of detailed reconstruction of the screen
- Support batch video processing
- Provides flexible model weighting options
Using Help
1. Environmental configuration
First you need to configure the runtime environment as follows:
- Clone the code repository:
git clone https://github.com/NJU-PCALab/STAR.git
cd STAR
- Create and activate the conda environment:
conda create -n star python=3.10
conda activate star
pip install -r requirements.txt
sudo apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
2. Model selection and downloading
STAR offers two versions of the model:
- I2VGen-XL based model:
- light_deg.pt: for light degradation video processing
- heavy_deg.pt: for heavily degraded video processing
- CogVideoX-5B based model:
- Specialized for processing heavily degraded videos
- Supports 720x480 resolution input only
Download the appropriate model weights from HuggingFace and place them in thepretrained_weight/
Catalog.
3. Video processing flow
- Prepare test data:
- Place the video to be processed into the
input/video/
catalogs - Cue word preparation (three choices):
- unprompted word
- Automatic generation using Pllava
- Manually writing video descriptions
- Place the video to be processed into the
- Configure processing parameters:
- modifications
video_super_resolution/scripts/inference_sr.sh
The path configuration in the- video_folder_path: input video path
- txt_file_path: prompt file path
- model_path: model weight path
- save_dir: output save path
- modifications
- Initiate reasoning:
bash video_super_resolution/scripts/inference_sr.sh
Note: If you encounter a memory overflow (OOM) problem, you can add a new file in theinference_sr.sh
midrange minor (in music)frame_length
Parameters.
4. CogVideoX-5B model special configuration
If using the CogVideoX-5B model, additional steps are required:
- Creation of specialized environments:
conda create -n star_cog python=3.10
conda activate star_cog
cd cogvideox-based/sat
pip install -r requirements.txt
- Download additional dependencies:
- Requires download of VAE and T5 Encoder
- update
cogvideox-based/sat/configs/cogvideox_5b/cogvideox_5b_infer_sr.yaml
The path configuration in the - Replacing the transformer.py file