AI Personal Learning
and practical guidance

VoiceCraft: open source zero-sample speech cloning and text-to-speech tool

General Introduction

VoiceCraft is an open source speech editing and zero-sample speech synthesis tool based on the Neural Codec language model. It employs an innovative coded sequence generation method that enables insertion, deletion and replacement operations on existing speech sequences to generate natural, coherent edited speech. Also, VoiceCraft supports zero-sample speech synthesis, eliminating the need for additional fine-tuning for specific speakers. The tool performs well on several speech processing tasks, significantly outperforming current industry SOTA models.

VoiceCraft: Open Source Zero-Sample Speech Cloning and Text-to-Speech Tool-1


 

Function List

  • Voice editing: Support insertion, deletion and replacement operations to generate natural and smooth editing voice.
  • Zero-sample speech synthesis: generates the target speaker's voice without additional fine-tuning.
  • Based on the Transformer architecture: causal masking and delayed stacking techniques are used to improve generation quality.
  • Open source models: available for free download and use on Huggingface and AI Express.
  • Interactive UI: Integration with the Gradio library allows users to intuitively control and test models.

 

Using Help

Installation process

  1. Clone the project repository to a local directory:
    git clone git@github.com:jasonppy/VoiceCraft.git
    cd VoiceCraft
    
  2. Ensure that Docker and NVIDIA Container Toolkit are installed on your system (Windows systems have built-in drivers):
    sudo apt-get install -y nvidia-container-toolkit-base
    
  3. Build the Docker image:
    docker build --tag "voicecraft" .
    
  4. Start an existing container or create a new one and pass in all GPUs:
    . /start-jupyter.sh  # Linux
    start-jupyter.bat   # Windows
    
  5. Open a browser and access the URL displayed on the terminal:
    docker logs jupyter
    
  6. Optional: access to the inside of the container from another terminal:
    docker exec -it jupyter /bin/bash
    export USER=(your_linux_username_used_above)
    export HOME=/home/$USER
    sudo apt-get update
    
  7. Verify that the graphics card is visible in the container:
    nvidia-smi
    
  8. Open in your browserinference_tts.ipynb, step-by-step execution of each cell.

Environmental settings

  1. Create and activate a virtual environment:
    conda create -n voicecraft python=3.9.16
    conda activate voicecraft
    
  2. Install the required dependencies:
    pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
    pip install xformers==0.0.22
    pip install torchaudio==2.0.2 torch==2.0.1
    apt-get install ffmpeg
    apt-get install espeak-ng
    pip install tensorboard==2.16.2
    pip install phonemizer==3.2.1
    pip install datasets==2.16.0
    pip install torchmetrics==0.11.1
    pip install huggingface_hub==0.22.2
    conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
    mfa model download dictionary english_us_arpa
    mfa model download acoustic english_us_arpa
    conda install -n voicecraft ipykernel --no-deps --force-reinstall
    

Example of reasoning

  1. Voice editing reasoning:
    pythonemize_encodec_encode_hf.py --dataset_size xs --download_to path/to/store_huggingface_downloads --save_dir path/to/store_ extracted_codes_and_phonemes --encodec_model_path path/to/encodec_model --mega_batch_size 120 --batch_size 32 --max_len 30000
    
  2. Zero-sample speech synthesis inference:
    python tts_demo.py -h
    

Gradio

  1. Running in Colab:
    Open in Colab
    
  2. Running locally:
    apt-get install -y espeak espeak-data libespeak1 libespeak-dev
    apt-get install -y festival*
    apt-get install -y build-essential
    apt-get install -y flac libasound2-dev libsndfile1-dev vorbis-tools
    apt-get install -y libxml2-dev libxslt-dev zlib1g-dev
    pip install -r gradio_requirements.txt
    python gradio_app.py
    

common problems

  • How to improve the naturalness of generated speech? Ensure that the input text content is consistent with the style and context of the target speech sample.
  • What should I do if the generated voice files are noisy? Try using higher quality speech samples or adjusting model parameters.
May not be reproduced without permission:Chief AI Sharing Circle " VoiceCraft: open source zero-sample speech cloning and text-to-speech tool

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish