VoiceCraft: open source zero-sample speech cloning and text-to-speech tool

Latest AI Resources10mos agoupdate AI Sharing Circle

General Introduction

VoiceCraft is an open source speech editing and zero-sample speech synthesis tool based on the Neural Codec language model. It employs an innovative coded sequence generation method that enables insertion, deletion and replacement operations on existing speech sequences to generate natural, coherent edited speech. Also, VoiceCraft supports zero-sample speech synthesis, eliminating the need for additional fine-tuning for specific speakers. The tool performs well on several speech processing tasks, significantly outperforming current industry SOTA models.

Function List

Voice editing: Support insertion, deletion and replacement operations to generate natural and smooth editing voice.
Zero-sample speech synthesis: generates the target speaker's voice without additional fine-tuning.
Based on the Transformer architecture: causal masking and delayed stacking techniques are used to improve generation quality.
Open source models: available for free download and use on Huggingface and AI Express.
Interactive UI: Integration with the Gradio library allows users to intuitively control and test models.

Using Help

Installation process

Clone the project repository to a local directory:

git clone git@github.com:jasonppy/VoiceCraft.git
cd VoiceCraft

Ensure that Docker and NVIDIA Container Toolkit are installed on your system (Windows systems have built-in drivers):
```
sudo apt-get install -y nvidia-container-toolkit-base
```
Build the Docker image:
```
docker build --tag "voicecraft" .
```
Start an existing container or create a new one and pass in all GPUs:
```
./start-jupyter.sh  # Linux
start-jupyter.bat   # Windows
```
Open a browser and access the URL displayed on the terminal:
```
docker logs jupyter
```

Optional: access to the inside of the container from another terminal:

docker exec -it jupyter /bin/bash
export USER=(your_linux_username_used_above)
export HOME=/home/$USER
sudo apt-get update

Verify that the graphics card is visible in the container:
```
nvidia-smi
```
Open in your browserinference_tts.ipynb, step-by-step execution of each cell.

Environmental settings

Create and activate a virtual environment:

conda create -n voicecraft python=3.9.16
conda activate voicecraft

Install the required dependencies:

pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
pip install xformers==0.0.22
pip install torchaudio==2.0.2 torch==2.0.1
apt-get install ffmpeg
apt-get install espeak-ng
pip install tensorboard==2.16.2
pip install phonemizer==3.2.1
pip install datasets==2.16.0
pip install torchmetrics==0.11.1
pip install huggingface_hub==0.22.2
conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
mfa model download dictionary english_us_arpa
mfa model download acoustic english_us_arpa
conda install -n voicecraft ipykernel --no-deps --force-reinstall

Example of reasoning

Voice editing reasoning:

python phonemize_encodec_encode_hf.py --dataset_size xs --download_to path/to/store_huggingface_downloads --save_dir path/to/store_extracted_codes_and_phonemes --encodec_model_path path/to/encodec_model --mega_batch_size 120 --batch_size 32 --max_len 30000

Zero-sample speech synthesis inference:
```
python tts_demo.py -h
```

Gradio

Running in Colab:
```
Open in Colab
```

Running locally:

apt-get install -y espeak espeak-data libespeak1 libespeak-dev
apt-get install -y festival*
apt-get install -y build-essential
apt-get install -y flac libasound2-dev libsndfile1-dev vorbis-tools
apt-get install -y libxml2-dev libxslt-dev zlib1g-dev
pip install -r gradio_requirements.txt
python gradio_app.py

common problems

How to improve the naturalness of generated speech? Ensure that the input text content is consistent with the style and context of the target speech sample.
What should I do if the generated voice files are noisy? Try using higher quality speech samples or adjusting model parameters.