General Introduction
GPT-SoVITS is an open source speech conversion and synthesis tool that combines the GPT model and SoVITS voice changer technology. The tool supports instant text-to-speech conversion with zero samples and a small number of samples, and voice style migration with only 5 seconds of audio samples. Features include cross-language support, built-in track separation, and other useful functions that make it easy for even beginners to create personalized voice models. Applicable to English, Japanese and Chinese, it combines with WebUI toolset to assist the whole process from data preprocessing to model training. Whether you are an AI novice or a professional, you can experience the charm of speech technology here.
Function List
- Zero Sample TTS: Enter a 5-second speech sample to experience text-to-speech conversion immediately.
- Sample less TTS: Fine-tune the model using only 1 minute of training data to improve sound similarity and realism.
- Cross-language support: Currently supports inferences for languages different from the training set, including English, Japanese, Korean, Cantonese and Mandarin.
- WebUI tools: integrated speech accompaniment separation, automatic training set segmentation, Chinese ASR and text annotation to help beginners create training data and GPT/SoVITS models.
Using Help
Installation process
Windows user
- Download the integration package.
- double-click
go-webui.bat
Start the GPT-SoVITS-WebUI. - Follow the interface prompts.
Linux user
- Create a virtual environment:
conda create -n GPTSoVits python=3.9
- Activate the virtual environment:
conda activate GPTSoVits
- Install the dependencies:
bash install.sh
macOS users
- Install the Xcode command line tool:
xcode-select --install
- Install FFmpeg:
brew install ffmpeg
- Create a virtual environment and install dependencies:
conda create -n GPTSoVits python=3.9 conda activate GPTSoVits pip install -r requirements.txt
Usage Process
- Data preparation: Prepare a speech sample of at least 5 seconds to be uploaded to the WebUI interface.
- model training: Select zero or few samples mode and upload the corresponding training data.
- phonetic transcription: Enter the text content, select the target speech sample, and click the Convert button.
- Results Export: After the conversion is complete, you can download the resulting audio file.
Functional operation details
- Zero sample TTS: Upload a 5-second voice sample in the WebUI interface, enter the text content and click the Convert button to generate the corresponding voice file.
- Sample less TTS: Upload at least 1 minute of training data for model fine-tuning to improve the similarity and realism of the generated speech.
- cross-language support: Select text content in different languages for input, and the system will automatically perform language conversion and speech generation.
- WebUI Tools: Simplify data processing and model training process by using built-in features such as speech accompaniment separation, automatic training set segmentation, Chinese ASR and text labeling.