ConsisID: a portrait reference map to generate character-consistent video, rapid multi-terminal integration

Latest AI Resources8mos agorelease AI Sharing Circle

2.1K 00

General Introduction

ConsisID is an open source project developed by Yuan Rong's group at Peking University, aiming to achieve identity-consistent text-to-video generation (IPT2V) through frequency decomposition techniques. The core of the project is a model based on DiT (Diffusion Transformer), which is able to maintain the identity consistency of characters when generating videos.The ConsisID project not only provides the complete code and dataset, but also includes detailed installation and usage guidelines to facilitate users to get started quickly. This project is of great significance in the field of video generation, especially in application scenarios where character consistency needs to be maintained, such as film and television production, virtual reality, and so on.

Function List

Identity Consistent Video Generation: A frequency decomposition technique is used to generate videos that are consistent with the description of the input text and maintain the identity of the characters.
Open source code and datasets: Complete code and partial datasets are provided to facilitate secondary development and research.
Multi-platform support: Support for running on Windows and Linux systems , providing Jupyter Notebook and ComfyUI extensions .
Optimization of high-quality prompt words: Optimize the input of text prompt words using GPT-4o to improve the quality of the generated video.
GPU memory optimization: Provides multiple GPU memory optimization options to adapt to different hardware configurations.
Community Contributions: Support for community-developed plug-ins and extensions that enhance functionality and usage experience.

Using Help

Environment Configuration

Clone the project code:

   git clone --depth=1 https://github.com/PKU-YuanGroup/ConsisID.git
cd ConsisID

Create and activate a virtual environment:

   conda create -n consisid python=3.11.0
conda activate consisid

Install the dependencies:

   pip install -r requirements.txt

Download model weights

Download weights from HuggingFace:

   huggingface-cli download --repo-type model BestWishYsh/ConsisID-preview --local-dir ckpts

Or download it from WiseModel:

   git lfs install
git clone https://www.wisemodel.cn/SHYuanBest/ConsisID-Preview.git

running example

Run the Web UI example:

   python app.py

Run command line reasoning:

   python infer.py --model_path BestWishYsh/ConsisID-preview

Cue word optimization

Use GPT-4o to optimize input text prompt words, e.g. Original prompt word: "A man is playing the guitar." Optimized prompt word: "The video shows a man standing next to an airplane, talking on his cell phone. He is wearing sunglasses, a black top, and a serious expression. The plane has a green stripe down the side and a big engine in the back."

GPU memory optimization

If you do not have multiple GPUs or enough GPU memory, you can enable the following options:

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

Note: Enabling these options increases inference time and may reduce generation quality.

Data preprocessing

Please refer to the data preprocessing guide in the project for the data needed to train ConsisID. If you need to train text-to-image and video generation models, you need to organize the dataset into the following format:

datasets/
├── captions/
│   ├── dataname_1.json
│   ├── dataname_2.json
├── dataname_1/
│   ├── refine_bbox_jsons/
│   ├── track_masks_data/
│   ├── videos/
├── dataname_2/
│   ├── refine_bbox_jsons/
│   ├── track_masks_data/
│   ├── videos/
├── ...
├── total_train_data.txt