General Introduction
MegaTTS3 is an open source speech synthesis tool developed by ByteDance in cooperation with Zhejiang University, focusing on generating high-quality Chinese and English speech. Its core model is only 0.45B parameters, lightweight and efficient, supporting mixed Chinese and English speech generation and speech cloning. The project is hosted on GitHub and provides code and pre-trained models for free download.MegaTTS3 can mimic the target voice with a few seconds of audio samples, and also supports adjusting the intensity of the accent. It is suitable for academic research, content creation, and development of speech applications, with pronunciation and duration control features to be added in the future.
Function List
- Generate Chinese, English and mixed speech with natural and smooth output.
- High-quality speech cloning is achieved with a small amount of audio that mimics a specific timbre.
- Supports accent strength adjustment to generate speech with accent or standard pronunciation.
- Use acoustic latents to improve model training efficiency.
- Built-in high-quality WaveVAE vocoder for enhanced speech intelligibility and realism.
- Aligner and Graphme-to-Phoneme submodules are provided to support speech analysis.
- Open source code and pre-trained models for customized development.
Using Help
MegaTTS3 requires some basic programming experience, especially with Python and deep learning environments. The following are detailed installation and usage instructions.
Installation process
- Build the environment
MegaTTS3 RecommendedPython 3.9
. This can be done withConda
Create a virtual environment:conda create -n megatts3-env python=3.9 conda activate megatts3-env
After activation, all operations are performed in this environment.
- Download Code
Run the following command in a terminal to cloneGitHub
Warehouse:git clone https://github.com/bytedance/MegaTTS3.git cd MegaTTS3
- Installation of dependencies
Project offersrequirements.txt
, run the following command to install the required libraries:pip install -r requirements.txt
Installation time varies by network and device and is usually completed in a few minutes.
- Getting the model
Pre-trained models can be downloaded from Google Drive or Hugging Face (see official links).README
). Download it and unzip it to./checkpoints/
Folder. Example:- commander-in-chief (military)
model.pth
put into./checkpoints/model.pth
The - pre-extracted
latents
Files need to be downloaded from the specified link into the same directory.
- commander-in-chief (military)
- test installation
Run a simple test command to verify the environment:python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "测试" --output_dir ./gen
If no errors are reported, the installation was successful.
Main Functions
speech synthesis
Generating speech is the core function of MegaTTS3. It requires the input of text and reference audio:
- Prepare the document
existassets/
folder into the reference audio (e.g.Chinese_prompt.wav
(math.) andlatents
Files (e.g.Chinese_prompt.npy
). If there is nolatents
, official pre-extracted files are required. - Run command
Input:CUDA_VISIBLE_DEVICES=0 python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "你好,这是一段测试语音" --output_dir ./gen
--input_wav
is the reference audio path.--input_text
is the text to be synthesized.--output_dir
is the output folder.
- View Results
The generated speech is saved in the./gen/output.wav
, which can be played directly.
voice cloning
It takes only a few seconds of audio samples to mimic a specific sound:
- Prepare clear reference audio (5-10 seconds recommended).
- Using the above synthesis command, specify the
--input_wav
The - The output voice will be as close as possible to the reference tone.
accent control
Adjusting Accent Strength via Parameters p_w
cap (a poem) t_w
::
- Enter English audio with an accent:
CUDA_VISIBLE_DEVICES=0 python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text "这是一条有口音的音频" --output_dir ./gen --p_w 1.0 --t_w 3.0
p_w
toward1.0
The original accent is retained at times, and the increase tends to standardize the pronunciation.t_w
Controls for timbre similarity, which is usually higher thanp_w
your (honorific)0-3
The- Generate standardized pronunciation:
CUDA_VISIBLE_DEVICES=0 python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text "这条音频发音标准一些" --output_dir ./gen --p_w 2.5 --t_w 2.5
Web UI Operations
Supports operation through a web interface:
- Running:
CUDA_VISIBLE_DEVICES=0 python tts/gradio_api.py
- Open your browser and enter the address (default)
localhost:7860
), upload audio and text to generate speech.CPU
Approx. 30 seconds in the environment.
Submodule usage
Aligner
- functionality: Align speech and text.
- usage: Run
tts/frontend_function.py
The example code in for speech segmentation or phoneme recognition.
Graphme-to-Phoneme
- functionality: Converts text to phonemes.
- usage: Reference
tts/infer_cli.py
, which can be used for pronunciation analysis.
WaveVAE
- functionality: Compressed audio is
latents
And rebuild. - limitation: the encoder parameters are not disclosed and can only be used with pre-extracted
latents
The
caveat
- WaveVAE encoder parameters are not available for security reasons and can only be used with official
latents
Documentation. - Project was released on March 22, 2025 and is still under development with new pronunciations and hourly adjustments planned.
GPU
Accelerated referrals.CPU
Runs but is slow.
application scenario
- academic research
Researchers can test speech synthesis technology with MegaTTS3, analyze thelatents
The effect of the - Educational aids
Convert textbooks to speech and generate audiobooks to enhance the learning experience. - content creation
Generate narration for videos or podcasts and save on manual recording costs. - voice interaction
Developers can integrate it into their devices to enable voice conversations in English and Chinese.
QA
- What languages are supported?
Supports Chinese, English and mixed speech, with the possibility of expanding to other languages in the future. - must
GPU
What? - I don't know.
Not required.CPU
It can be run, but it is slow and is recommended to use theGPU
The - How do I handle installation failures?
updatepip
(pip install --upgrade pip
), checking the network, or inGitHub
Submit issue. - Why are WaveVAE encoders missing?
Undisclosed for security reasons, official pre-extraction requiredlatents
The