Seed-VC: supports real-time conversion of speech and song with fewer samples

Latest AI Resources5mos agorelease AI Sharing Circle

1.6K 00

General Introduction

Seed-VC is an open source project on GitHub, developed by Plachtaa. It can use a piece of 1 to 30 seconds of reference audio, quickly realize the voice or song conversion, without additional training. The project supports real-time voice conversion , latency as low as 400 milliseconds or so , suitable for online meetings , games or live use . Seed-VC provides three modes : voice conversion (VC), song conversion (SVC) and real-time conversion . It uses Whisper and BigVGAN to ensure clear sound. The code is free and open to the public, and users can download and build it locally. Official updates, detailed documentation, and active community support.

Function List

Supports zero-sample conversion: mimic the target voice or song with short audio.
Real-time voice processing: the voice instantly changes to the target tone after microphone input.
Song Conversion: Convert any song to the voice of a specified singer.
Audio Length Adjustment: Speed up or slow down speech to control the tempo.
Pitch Adjustment: Automatically or manually adjust the pitch to fit the target tone.
Web interface operation: Provides a simple graphical interface for ease of use.
Supports custom training: optimize specific sounds with a small amount of data.
Open source code: user modifiable or upgradable features.

Using Help

Installation process

To use Seed-VC locally, you need to install the environment first. Below are the detailed steps for Windows, Mac (with M-series chips) or Linux.

Preparing the environment
- Install Python 3.10, just download it from the official website.
- To install Git, search for "Git for Windows" for Windows users, or brew install git for Mac.
- GPU users need to install CUDA 12.4 and corresponding drivers, CPU can also run but slower.
- To install FFmpeg for audio processing, download it from the official website for Windows, install ffmpeg with brew for Mac, and install it with the package manager for Linux.
Download Code
- Open a command line (CMD or Anaconda Prompt for Windows, Terminal for Mac/Linux).
- Type git clone https://github.com/Plachtaa/seed-vc.git to download the project.
- Go to the directory: cd seed-vc .
Setting up a virtual environment
- Type python -m venv venv to create a standalone environment.
- Activate the environment:
  - Windows: venv\Scripts\activate
  - Mac/Linux: source venv/bin/activate
- See (venv) for success.
Installation of dependencies
- Windows/Linux Enter pip install -r requirements.txt .
- For Mac M series, enter pip install -r requirements-mac.txt.
- Add mirroring for network problems: HF_ENDPOINT=https://hf-mirror.com pip install -r requirements.txt .
running program
- Voice conversion: python app_vc.py
- Song conversion: python app_svc.py
- Real-time conversion: python real-time-gui.py
- Once running, a browser accesses http://localhost:7860 to use the interface.

Main Functions

1. Voice conversion (VC)

move::
1. Run python app_vc.py and open your browser to http://localhost:7860.
2. Upload the original audio (Source Audio) and reference audio (Reference Audio, 1-30 seconds).
3. Set Diffusion Steps, default 25, set 30-50 for better sound quality.
4. Length Adjust, less than 1 speeds up, greater than 1 slows down.
5. Click Submit, wait a few seconds and download the conversion result.
take note of::
- The first run will automatically download the model seed-uvit-whisper-small-wavenet.
- Reference audio is cut off after 30 seconds.

2. Song Voice Conversion (SVC)

move::
1. Run python app_svc.py to open the web interface.
2. Upload song audio and singer reference audio.
3. Check f0-condition to maintain the pitch of the song.
4. Optional auto-f0-adjust Automatically adjusts the pitch.
5. Set the number of diffusion steps to 30-50 and click Submit.
finesse::
- Use reference audio that is clear and free of background noise for best results.
- Models download seed-uvit-whisper-base by default.

3. Real-time conversion

move::
1. Run python real-time-gui.py to open the interface.
2. Upload the reference audio and connect the microphone.
3. Setting parameters: diffusion steps 4-10, Block Time 0.18 sec.
4. Click "Start" and the voice changes in real time while speaking.
5. Use VB-CABLE to route the output to a virtual microphone.
request::
- GPU recommendations (e.g. RTX 3060) with a latency of about 430 milliseconds.
- CPU running latency is higher.

4. Command-line operations

Speech Conversion Example::

 python inference.py --source input.wav --target ref.wav --output ./out --diffusion-steps 25 --length-adjust 1.0 --fp16 True

Song Conversion Example::

 python inference.py --source song.wav --target singer.wav --output ./out --diffusion-steps 50 --f0-condition True --semi-tone-shift 0 --fp16 True

5. Customized training

move::

Prepare a 1-30 second audio file (.wav/.mp3, etc.) in a folder.

Run training:

 python train.py --config configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml --dataset-dir ./data --run-name myrun --max-steps 1000

Post-training Checkpoint in . /runs/myrun/ft_model.pth .

Reasoning with Custom Models:

 python app_svc.py --checkpoint ./runs/myrun/ft_model.pth --config configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml

take note of: at least 1 audio sample to train, about 2 minutes for 100 steps (T4 GPU).

supplementary note

Model Selection::
- Real-time with seed-uvit-tat-xlsr-tiny (25M parameters).
- Offline voice with seed-uvit-whisper-small-wavenet (98M parameters).
- For vocals, use seed-uvit-whisper-base (200M parameters, 44kHz).
adjust components during testing::
- Report ModuleNotFoundError , check the dependency.
- Macs may need Python with Tkinter installed to run real-time GUIs.

application scenario

entertainment dubbing
Turn voices into cartoon characters to make funny videos.
music production
Transforms ordinary vocals into professional singer tones, generating song demos.
live interaction
Anchors change voices in real time to increase the fun of the program.
language learning
Imitate native speakers' speech and practice pronunciation.

QA

Need a lot of data?
No. 1 short audio clip can be converted and only 1 sample is needed for training.
Does it support Chinese audio?
Support. As long as the reference audio is in Chinese, the conversion is also clear.
What about high latency?
Use the GPU and set a low diffusion step (4-10).
What about poor sound quality?
Increase diffusion steps to 50, or use clean reference audio.