MegaTTS3: A Lightweight Model for Synthesizing Chinese and English Speech

Latest AI Resources4mos agorelease AI Sharing Circle

1.3K 00

General Introduction

MegaTTS3 is an open source speech synthesis tool developed by ByteDance in cooperation with Zhejiang University, focusing on generating high-quality Chinese and English speech. Its core model is only 0.45B parameters, lightweight and efficient, supporting mixed Chinese and English speech generation and speech cloning. The project is hosted on GitHub and provides code and pre-trained models for free download.MegaTTS3 can mimic the target voice with a few seconds of audio samples, and also supports adjusting the intensity of the accent. It is suitable for academic research, content creation, and development of speech applications, with pronunciation and duration control features to be added in the future.

Function List

Generate Chinese, English and mixed speech with natural and smooth output.
High-quality speech cloning is achieved with a small amount of audio that mimics a specific timbre.
Supports accent strength adjustment to generate speech with accent or standard pronunciation.
Use acoustic latents to improve model training efficiency.
Built-in high-quality WaveVAE vocoder for enhanced speech intelligibility and realism.
Aligner and Graphme-to-Phoneme submodules are provided to support speech analysis.
Open source code and pre-trained models for customized development.

Using Help

MegaTTS3 requires some basic programming experience, especially with Python and deep learning environments. The following are detailed installation and usage instructions.

Installation process

Build the environment
MegaTTS3 Recommended Python 3.9. This can be done with Conda Create a virtual environment:
```
conda create -n megatts3-env python=3.9
conda activate megatts3-env
```
After activation, all operations are performed in this environment.
Download Code
Run the following command in a terminal to clone GitHub Warehouse:
```
git clone https://github.com/bytedance/MegaTTS3.git
cd MegaTTS3
```
Installation of dependencies
Project offers requirements.txt, run the following command to install the required libraries:
```
pip install -r requirements.txt
```
Installation time varies by network and device and is usually completed in a few minutes.
Getting the model
Pre-trained models can be downloaded from Google Drive or Hugging Face (see official links). README). Download it and unzip it to ./checkpoints/ Folder. Example:
- commander-in-chief (military) model.pth put into ./checkpoints/model.pthThe
- pre-extracted latents Files need to be downloaded from the specified link into the same directory.
test installation
Run a simple test command to verify the environment:
```
python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "测试" --output_dir ./gen
```
If no errors are reported, the installation was successful.

Main Functions

speech synthesis

Generating speech is the core function of MegaTTS3. It requires the input of text and reference audio:

Prepare the document
exist assets/ folder into the reference audio (e.g. Chinese_prompt.wav(math.) and latents Files (e.g. Chinese_prompt.npy). If there is no latents, official pre-extracted files are required.

Run command
Input:

CUDA_VISIBLE_DEVICES=0 python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "你好，这是一段测试语音" --output_dir ./gen

--input_wav is the reference audio path.
--input_text is the text to be synthesized.
--output_dir is the output folder.

View Results
The generated speech is saved in the ./gen/output.wav, which can be played directly.

voice cloning

It takes only a few seconds of audio samples to mimic a specific sound:

Prepare clear reference audio (5-10 seconds recommended).
Using the above synthesis command, specify the --input_wavThe
The output voice will be as close as possible to the reference tone.

accent control

Adjusting Accent Strength via Parameters p_w cap (a poem) t_w::

Enter English audio with an accent:

CUDA_VISIBLE_DEVICES=0 python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text "这是一条有口音的音频" --output_dir ./gen --p_w 1.0 --t_w 3.0

p_w toward 1.0 The original accent is retained at times, and the increase tends to standardize the pronunciation.
t_w Controls for timbre similarity, which is usually higher than p_w your (honorific) 0-3The

Generate standardized pronunciation:

CUDA_VISIBLE_DEVICES=0 python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text "这条音频发音标准一些" --output_dir ./gen --p_w 2.5 --t_w 2.5

Web UI Operations

Supports operation through a web interface:

Running:

CUDA_VISIBLE_DEVICES=0 python tts/gradio_api.py

Open your browser and enter the address (default) localhost:7860), upload audio and text to generate speech.CPU Approx. 30 seconds in the environment.

Submodule usage

Aligner

functionality: Align speech and text.
usage: Run tts/frontend_function.py The example code in for speech segmentation or phoneme recognition.

Graphme-to-Phoneme

functionality: Converts text to phonemes.
usage: Reference tts/infer_cli.py, which can be used for pronunciation analysis.

WaveVAE

functionality: Compressed audio is latents And rebuild.
limitation: the encoder parameters are not disclosed and can only be used with pre-extracted latentsThe

caveat

WaveVAE encoder parameters are not available for security reasons and can only be used with official latents Documentation.
Project was released on March 22, 2025 and is still under development with new pronunciations and hourly adjustments planned.
GPU Accelerated referrals.CPU Runs but is slow.

application scenario

academic research
Researchers can test speech synthesis technology with MegaTTS3, analyze the latents The effect of the
Educational aids
Convert textbooks to speech and generate audiobooks to enhance the learning experience.
content creation
Generate narration for videos or podcasts and save on manual recording costs.
voice interaction
Developers can integrate it into their devices to enable voice conversations in English and Chinese.

QA

What languages are supported?
Supports Chinese, English and mixed speech, with the possibility of expanding to other languages in the future.
must GPU What? - I don't know.
Not required.CPU It can be run, but it is slow and is recommended to use the GPUThe
How do I handle installation failures?
update pip(pip install --upgrade pip), checking the network, or in GitHub Submit issue.
Why are WaveVAE encoders missing?
Undisclosed for security reasons, official pre-extraction required latentsThe