AI Personal Learning
and practical guidance
豆包Marscode1

MegaTTS3: A Lightweight Model for Synthesizing Chinese and English Speech

General Introduction

MegaTTS3 is an open source speech synthesis tool developed by ByteDance in cooperation with Zhejiang University, focusing on generating high-quality Chinese and English speech. Its core model is only 0.45B parameters, lightweight and efficient, supporting mixed Chinese and English speech generation and speech cloning. The project is hosted on GitHub and provides code and pre-trained models for free download.MegaTTS3 can mimic the target voice with a few seconds of audio samples, and also supports adjusting the intensity of the accent. It is suitable for academic research, content creation, and development of speech applications, with pronunciation and duration control features to be added in the future.

MegaTTS3:合成中英文语音的轻量模型-1


 

Function List

  • Generate Chinese, English and mixed speech with natural and smooth output.
  • High-quality speech cloning is achieved with a small amount of audio that mimics a specific timbre.
  • Supports accent strength adjustment to generate speech with accent or standard pronunciation.
  • Use acoustic latents to improve model training efficiency.
  • Built-in high-quality WaveVAE vocoder for enhanced speech intelligibility and realism.
  • Aligner and Graphme-to-Phoneme submodules are provided to support speech analysis.
  • Open source code and pre-trained models for customized development.

 

Using Help

MegaTTS3 requires some basic programming experience, especially with Python and deep learning environments. The following are detailed installation and usage instructions.

Installation process

  1. Build the environment
    MegaTTS3 Recommended Python 3.9. This can be done with Conda Create a virtual environment:

    conda create -n megatts3-env python=3.9
    conda activate megatts3-env
    

    After activation, all operations are performed in this environment.

  2. Download Code
    Run the following command in a terminal to clone GitHub Warehouse:

    git clone https://github.com/bytedance/MegaTTS3.git
    cd MegaTTS3
    
  3. Installation of dependencies
    Project offers requirements.txt, run the following command to install the required libraries:

    pip install -r requirements.txt
    

    Installation time varies by network and device and is usually completed in a few minutes.

  4. Getting the model
    Pre-trained models can be downloaded from Google Drive or Hugging Face (see official links). README). Download it and unzip it to ./checkpoints/ Folder. Example:

    • commander-in-chief (military) model.pth put into ./checkpoints/model.pthThe
    • pre-extracted latents Files need to be downloaded from the specified link into the same directory.
  5. test installation
    Run a simple test command to verify the environment:

    python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "测试" --output_dir ./gen
    

    If no errors are reported, the installation was successful.

Main Functions

speech synthesis

Generating speech is the core function of MegaTTS3. It requires the input of text and reference audio:

  • Prepare the document
    exist assets/ folder into the reference audio (e.g. Chinese_prompt.wav(math.) and latents Files (e.g. Chinese_prompt.npy). If there is no latents, official pre-extracted files are required.
  • Run command
    Input:

    CUDA_VISIBLE_DEVICES=0 python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "你好,这是一段测试语音" --output_dir ./gen
    
    • --input_wav is the reference audio path.
    • --input_text is the text to be synthesized.
    • --output_dir is the output folder.
  • View Results
    The generated speech is saved in the ./gen/output.wav, which can be played directly.

voice cloning

It takes only a few seconds of audio samples to mimic a specific sound:

  • Prepare clear reference audio (5-10 seconds recommended).
  • Using the above synthesis command, specify the --input_wavThe
  • The output voice will be as close as possible to the reference tone.

accent control

Adjusting Accent Strength via Parameters p_w cap (a poem) t_w::

  • Enter English audio with an accent:
    CUDA_VISIBLE_DEVICES=0 python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text "这是一条有口音的音频" --output_dir ./gen --p_w 1.0 --t_w 3.0
    
  • p_w toward 1.0 The original accent is retained at times, and the increase tends to standardize the pronunciation.
  • t_w Controls for timbre similarity, which is usually higher than p_w your (honorific) 0-3The
  • Generate standardized pronunciation:
    CUDA_VISIBLE_DEVICES=0 python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text "这条音频发音标准一些" --output_dir ./gen --p_w 2.5 --t_w 2.5
    

Web UI Operations

Supports operation through a web interface:

  • Running:
    CUDA_VISIBLE_DEVICES=0 python tts/gradio_api.py
    
  • Open your browser and enter the address (default) localhost:7860), upload audio and text to generate speech.CPU Approx. 30 seconds in the environment.

Submodule usage

Aligner

  • functionality: Align speech and text.
  • usage: Run tts/frontend_function.py The example code in for speech segmentation or phoneme recognition.

Graphme-to-Phoneme

  • functionality: Converts text to phonemes.
  • usage: Reference tts/infer_cli.py, which can be used for pronunciation analysis.

WaveVAE

  • functionality: Compressed audio is latents And rebuild.
  • limitation: the encoder parameters are not disclosed and can only be used with pre-extracted latentsThe

caveat

  • WaveVAE encoder parameters are not available for security reasons and can only be used with official latents Documentation.
  • Project was released on March 22, 2025 and is still under development with new pronunciations and hourly adjustments planned.
  • GPU Accelerated referrals.CPU Runs but is slow.

 

 

application scenario

  1. academic research
    Researchers can test speech synthesis technology with MegaTTS3, analyze the latents The effect of the
  2. Educational aids
    Convert textbooks to speech and generate audiobooks to enhance the learning experience.
  3. content creation
    Generate narration for videos or podcasts and save on manual recording costs.
  4. voice interaction
    Developers can integrate it into their devices to enable voice conversations in English and Chinese.

 

QA

  1. What languages are supported?
    Supports Chinese, English and mixed speech, with the possibility of expanding to other languages in the future.
  2. must GPU What? - I don't know.
    Not required.CPU It can be run, but it is slow and is recommended to use the GPUThe
  3. How do I handle installation failures?
    update pip(pip install --upgrade pip), checking the network, or in GitHub Submit issue.
  4. Why are WaveVAE encoders missing?
    Undisclosed for security reasons, official pre-extraction required latentsThe
May not be reproduced without permission:Chief AI Sharing Circle " MegaTTS3: A Lightweight Model for Synthesizing Chinese and English Speech
en_USEnglish