General Introduction
VoiceCraft is an open source speech editing and zero-sample speech synthesis tool based on the Neural Codec language model. It employs an innovative coded sequence generation method that enables insertion, deletion and replacement operations on existing speech sequences to generate natural, coherent edited speech. Also, VoiceCraft supports zero-sample speech synthesis, eliminating the need for additional fine-tuning for specific speakers. The tool performs well on several speech processing tasks, significantly outperforming current industry SOTA models.
Function List
- Voice editing: Support insertion, deletion and replacement operations to generate natural and smooth editing voice.
- Zero-sample speech synthesis: generates the target speaker's voice without additional fine-tuning.
- Based on the Transformer architecture: causal masking and delayed stacking techniques are used to improve generation quality.
- Open source models: available for free download and use on Huggingface and AI Express.
- Interactive UI: Integration with the Gradio library allows users to intuitively control and test models.
Using Help
Installation process
- Clone the project repository to a local directory:
git clone git@github.com:jasonppy/VoiceCraft.git cd VoiceCraft
- Ensure that Docker and NVIDIA Container Toolkit are installed on your system (Windows systems have built-in drivers):
sudo apt-get install -y nvidia-container-toolkit-base
- Build the Docker image:
docker build --tag "voicecraft" .
- Start an existing container or create a new one and pass in all GPUs:
. /start-jupyter.sh # Linux start-jupyter.bat # Windows
- Open a browser and access the URL displayed on the terminal:
docker logs jupyter
- Optional: access to the inside of the container from another terminal:
docker exec -it jupyter /bin/bash export USER=(your_linux_username_used_above) export HOME=/home/$USER sudo apt-get update
- Verify that the graphics card is visible in the container:
nvidia-smi
- Open in your browser
inference_tts.ipynb
, step-by-step execution of each cell.
Environmental settings
- Create and activate a virtual environment:
conda create -n voicecraft python=3.9.16 conda activate voicecraft
- Install the required dependencies:
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft pip install xformers==0.0.22 pip install torchaudio==2.0.2 torch==2.0.1 apt-get install ffmpeg apt-get install espeak-ng pip install tensorboard==2.16.2 pip install phonemizer==3.2.1 pip install datasets==2.16.0 pip install torchmetrics==0.11.1 pip install huggingface_hub==0.22.2 conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068 mfa model download dictionary english_us_arpa mfa model download acoustic english_us_arpa conda install -n voicecraft ipykernel --no-deps --force-reinstall
Example of reasoning
- Voice editing reasoning:
pythonemize_encodec_encode_hf.py --dataset_size xs --download_to path/to/store_huggingface_downloads --save_dir path/to/store_ extracted_codes_and_phonemes --encodec_model_path path/to/encodec_model --mega_batch_size 120 --batch_size 32 --max_len 30000
- Zero-sample speech synthesis inference:
python tts_demo.py -h
Gradio
- Running in Colab:
Open in Colab
- Running locally:
apt-get install -y espeak espeak-data libespeak1 libespeak-dev apt-get install -y festival* apt-get install -y build-essential apt-get install -y flac libasound2-dev libsndfile1-dev vorbis-tools apt-get install -y libxml2-dev libxslt-dev zlib1g-dev pip install -r gradio_requirements.txt python gradio_app.py
common problems
- How to improve the naturalness of generated speech? Ensure that the input text content is consistent with the style and context of the target speech sample.
- What should I do if the generated voice files are noisy? Try using higher quality speech samples or adjusting model parameters.