GPT SoVITS: Revolutionary Speech Generation and Speech Cloning Tools

Latest AI Resources1yrs agoupdate AI Sharing Circle

General Introduction

GPT-SoVITS is an open source speech conversion and synthesis tool that combines the GPT model and SoVITS voice changer technology. The tool supports instant text-to-speech conversion with zero samples and a small number of samples, and voice style migration with only 5 seconds of audio samples. Features include cross-language support, built-in track separation, and other useful functions that make it easy for even beginners to create personalized voice models. Applicable to English, Japanese and Chinese, it combines with WebUI toolset to assist the whole process from data preprocessing to model training. Whether you are an AI novice or a professional, you can experience the charm of speech technology here.

Function List

Zero Sample TTS: Enter a 5-second speech sample to experience text-to-speech conversion immediately.
Sample less TTS: Fine-tune the model using only 1 minute of training data to improve sound similarity and realism.
Cross-language support: Currently supports inferences for languages different from the training set, including English, Japanese, Korean, Cantonese and Mandarin.
WebUI tools: integrated speech accompaniment separation, automatic training set segmentation, Chinese ASR and text annotation to help beginners create training data and GPT/SoVITS models.

Using Help

Installation process

Windows user

Download the integration package.
double-clickgo-webui.batStart the GPT-SoVITS-WebUI.
Follow the interface prompts.

Linux user

Create a virtual environment:conda create -n GPTSoVits python=3.9
Activate the virtual environment:conda activate GPTSoVits
Install the dependencies:bash install.sh

macOS users

Install the Xcode command line tool:xcode-select --install
Install FFmpeg:brew install ffmpeg

Create a virtual environment and install dependencies:

conda create -n GPTSoVits python=3.9
conda activate GPTSoVits
pip install -r requirements.txt

Usage Process

Data preparation: Prepare a speech sample of at least 5 seconds to be uploaded to the WebUI interface.
model training: Select zero or few samples mode and upload the corresponding training data.
phonetic transcription: Enter the text content, select the target speech sample, and click the Convert button.
Results Export: After the conversion is complete, you can download the resulting audio file.

Functional operation details

Zero sample TTS: Upload a 5-second voice sample in the WebUI interface, enter the text content and click the Convert button to generate the corresponding voice file.
Sample less TTS: Upload at least 1 minute of training data for model fine-tuning to improve the similarity and realism of the generated speech.
cross-language support: Select text content in different languages for input, and the system will automatically perform language conversion and speech generation.
WebUI Tools: Simplify data processing and model training process by using built-in features such as speech accompaniment separation, automatic training set segmentation, Chinese ASR and text labeling.