AI Personal Learning
and practical guidance

JoyGen: Audio-Driven 3D Depth-Sensitive Portrait Talking Video Editing Tool

General Introduction

JoyGen is an innovative two-stage talking face video generation framework focused on solving the problem of audio-driven facial expression generation. Developed by a team from Jingdong Technology, the project employs advanced 3D reconstruction techniques and audio feature extraction methods to accurately capture the identity features and expression coefficients of the speaker for high-quality lip-synchronization and visual synthesis.The JoyGen framework consists of two main phases: firstly, audio-based lip-motion generation, and secondly, visual appearance synthesis. By integrating audio features and facial depth maps, it provides comprehensive supervision for accurate lip-synchronization. The project not only supports Chinese and English audio drivers, but also provides a complete training and inference pipeline, making it a powerful open source tool.

JoyGen: Audio-driven 3D depth-aware portrait talking video editing tool-1


 

Function List

  • Audio-driven 3D facial expression generation and editing
  • Precise lip-sync-audio technology
  • Supports Chinese and English audio input
  • 3D depth-aware visual synthesis
  • Facial identity retention function
  • High-quality video generation and editing capabilities
  • Complete training and reasoning framework support
  • Pre-trained models support rapid deployment
  • Support for customized dataset training
  • Provide detailed data preprocessing tools

 

Using Help

1. Environmental configuration

1.1 Basic environmental requirements

  • Supported GPUs: V100, A800
  • Python version: 3.8.19
  • System dependencies: ffmpeg

1.2 Installation steps

  1. Create and activate the conda environment:
conda create -n joygen python=3.8.19 ffmpeg
conda activate joygen
pip install -r requirements.txt
  1. Install the Nvdiffrast library:
git clone https://github.com/NVlabs/nvdiffrast
cd nvdiffrast
pip install .
  1. Download pre-trained model
    From the provideddownload linkGet the pre-trained model and place it in the specified directory structure in the. /pretrained_models/Catalog.

2. Utilization process

2.1 Reasoning process

Execute the full inference pipeline:

bash scripts/inference_pipeline.sh Audio files Video files Result directory

Execute the reasoning process in steps:

  1. Extracting facial expression coefficients from audio:
python inference_audio2motion.py --a2m_ckpt . /pretrained_models/audio2motion/240210_real3dportrait_orig/audio2secc_vae --hubert_path . /pretrained_models/audio2motion/hubert --drv_aud . /demo/xinwen_5s.mp3 --seed 0 --result_dir . /results/a2m --exp_file xinwen_5s.npy
  1. Renders depth maps frame-by-frame using new expression coefficients:
python -u inference_edit_expression.py --name face_recon_feat0.2_augment --epoch=20 --use_opengl False --checkpoints_dir . /pretrained_models ---bfm_folder . /pretrained_models/BFM --infer_video_path . /demo/example_5s.mp4 --infer_exp_coeff_path . /results/a2m/xinwen_5s.npy --infer_result_dir . /results/edit_expression
  1. Generating facial animations based on audio features and facial depth maps:
CUDA_VISIBLE_DEIVCES=0 python -u inference_joygen.py --unet_model_path pretrained_models/joygen --vae_model_path pretrained_models/sd-vae --ft-mse --intermediate_dir . /results/edit_expression --audio_path demo/xinwen_5s.mp3 --video_path demo/example_5s.mp4 --enable_pose_driven --result_dir results/talk - --img_size 256 --gpu_id 0

2.2 Training process

  1. Data preprocessing:
python -u preprocess_dataset.py --checkpoints_dir . /pretrained_models --name face_recon_feat0.2_augment --epoch=20 --use_opengl False ---bfm_folder . /pretrained_models/BFM --video_dir . /demo --result_dir . /results/preprocessed_dataset
  1. Examine preprocessed data and generate training lists:
python -u preprocess_dataset_extra.py data_dir
  1. Start training:
    Modify the config.yaml file and execute it:
accelerate launch --main_process_port 29501 --config_file config/accelerate_config.yaml train_joygen.py
May not be reproduced without permission:Chief AI Sharing Circle " JoyGen: Audio-Driven 3D Depth-Sensitive Portrait Talking Video Editing Tool

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish