General Introduction
MaskGCT (Masked Generative Codec Transformer) is a fully non-autoregressive Text-to-Speech (TTS) model jointly introduced by Funky Maru Technology and The Chinese University of Hong Kong. The model eliminates the need for explicit text-to-speech alignment information and adopts a two-stage generation approach, first predicting semantic coding from text, and then generating acoustic coding from semantic coding.MaskGCT performs well in the zero-sample TTS task, providing high-quality, similar, and easy-to-understand speech output.
Public Beta Product: Funmaru Chiyo, Voice Cloning and Video Multilingual Translation Tool
Thesis: https://arxiv.org/abs/2409.00750

Online demo: https://huggingface.co/spaces/amphion/maskgct
Function List
- Text-to-speech conversion (TTS): Converts input text to speech output.
- semantic encoding: Converts speech into semantic coding for subsequent processing.
- acoustic code: Convert semantic coding to acoustic coding and reconstruct the audio waveform.
- zero-sample learning: High-quality speech synthesis without explicit alignment information.
- Pre-trained models: A wide range of pre-trained models are available to support rapid deployment and utilization.
Using Help
Installation process
- cloning project::
git clone https://github.com/open-mmlab/Amphion.git
- Create an environment and install dependencies::
bash ./models/tts/maskgct/env.sh
Usage Process
- Download pre-trained model: The required pre-trained models can be downloaded from HuggingFace:
from huggingface_hub import hf_hub_download # 下载语义编码模型 semantic_code_ckpt = hf_hub_download("amphion/MaskGCT", filename="semantic_codec/model.safetensors") # 下载声学编码模型 codec_encoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model.safetensors") codec_decoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors") # 下载TTS模型 t2s_model_ckpt = hf_hub_download("amphion/MaskGCT", filename="t2s_model/model.safetensors")
- Generate Speech: Use the following code to generate speech from text:
# 导入必要的库 from amphion.models.tts.maskgct import MaskGCT # 初始化模型 model = MaskGCT() # 输入文本 text = "你好,欢迎使用MaskGCT模型。" # 生成语音 audio = model.text_to_speech(text) # 保存生成的语音 with open("output.wav", "wb") as f: f.write(audio)
- model training: If you need to train your own model, you can refer to the training scripts and configuration files in the project for data preparation and model training.
caveat
- Environment Configuration: Ensure that all necessary dependent libraries are installed and environment variables are configured correctly.
- Data preparation: Training with high quality speech data for better speech synthesis.
- model optimization: Adjust model parameters and training strategies to achieve optimal performance according to specific application scenarios.
Local deployment tutorial (with local one-click installer)
A few days ago, another non-autoregressive text-to-speech AI model, MaskGCT, opened its source code. Like the F5-TTS model, which is also non-autoregressive, the MaskGCT model is trained on the 100,000-hour dataset Emilia, and is proficient in cross-language synthesis of six languages, namely, Chinese, English, Japanese, Korean, French and German. The dataset Emilia is one of the largest and most diverse high-quality multilingual speech datasets in the world.
This time, we share how to deploy the MaskGCT project locally to get your graphics card firing again.
Installation of basic dependencies
First of all, make sure that Python 3.11 is installed locally, you can go to Python's official download package.
python.org
Subsequent cloning of the official program.
git clone https://github.com/open-mmlab/Amphion.git
Official linux-based installation shell scripts are provided:
pip install setuptools ruamel.yaml tqdm
pip install tensorboard tensorboardX torch==2.0.1
pip install transformers===4.41.1
pip install -U encodec
pip install black==24.1.1
pip install oss2
sudo apt-get install espeak-ng
pip install phonemizer
pip install g2p_en
pip install accelerate==0.31.0
pip install funasr zhconv zhon modelscope
# pip install git+https://github.com/lhotse-speech/lhotse
pip install timm
pip install jieba cn2an
pip install unidecode
pip install -U cos-python-sdk-v5
pip install pypinyin
pip install jiwer
pip install omegaconf
pip install pyworld
pip install py3langid==0.2.2 LangSegment
pip install onnxruntime
pip install pyopenjtalk
pip install pykakasi
pip install -U openai-whisper
Here the author converts the requirements.txt dependency file for Windows:
setuptools
ruamel.yaml
tqdm
transformers===4.41.1
encodec
black==24.1.1
oss2
phonemizer
g2p_en
accelerate==0.31.0
funasr
zhconv
zhon
modelscope
timm
jieba
cn2an
unidecode
cos-python-sdk-v5
pypinyin
jiwer
omegaconf
pyworld
py3langid==0.2.2
LangSegment
onnxruntime
pyopenjtalk
pykakasi
openai-whisper
json5
Run command:
pip3 install -r requirements.txt
Just install the dependencies.
Install onnxruntime-gpu.
pip3 install onnxruntime-gpu
Installation of the torch 3-piece set.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Windows configuration espeak-ng
Since the backend of the MaskGCT project relies on the espeak software, it needs to be configured locally. eSpeak is a compact, open source text-to-speech (TTS) synthesizer that supports multiple languages and accents. It uses a "resonance peak synthesis" approach that allows for multiple languages in a small footprint. The speech is clear and can be used at high speeds, but it is not as natural and smooth as larger synthesizers based on recordings of human speech, and MaskGCT builds on espeak's synthesis with secondary reasoning.
First run the command to install espeak:
winget install espeak
If you can't install it, you can also download the installer and install it manually:
https://sourceforge.net/projects/espeak/files/espeak/espeak-1.48/setup_espeak-1.48.04.exe/download
Then download the espeak-ng installer:
https://github.com/espeak-ng/espeak-ng/releases
Download and double-click to install.
Then copy C:\Program Files\eSpeak NG\libespeak-ng.dll to the C:\Program Files (x86)\eSpeak\command_line directory.
Then rename libespeak-ng.dll to espeak-ng.dll
Finally, just configure the C:\Program Files (x86)\eSpeak\command_line directory to the environment variable.
MaskGCT Local Reasoning
With all that configured, write the inference script local_test.py:
from models.tts.maskgct.maskgct_utils import *
from huggingface_hub import hf_hub_download
import safetensors
import soundfile as sf
import os
import argparse
os.environ['HF_HOME'] = os.path.join(os.path.dirname(__file__), 'hf_download')
print(os.path.join(os.path.dirname(__file__), 'hf_download'))
parser = argparse.ArgumentParser(description="GPT-SoVITS api")
parser.add_argument("-p", "--prompt_text", type=str, default="说得好像您带我以来我考好过几次一样")
parser.add_argument("-a", "--audio", type=str, default="./说得好像您带我以来我考好过几次一样.wav")
parser.add_argument("-t", "--text", type=str, default="你好")
parser.add_argument("-l", "--language", type=str, default="zh")
parser.add_argument("-lt", "--target_language", type=str, default="zh")
args = parser.parse_args()
if __name__ == "__main__":
# download semantic codec ckpt
semantic_code_ckpt = hf_hub_download("amphion/MaskGCT", filename="semantic_codec/model.safetensors")
# download acoustic codec ckpt
codec_encoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model.safetensors")
codec_decoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors")
# download t2s model ckpt
t2s_model_ckpt = hf_hub_download("amphion/MaskGCT", filename="t2s_model/model.safetensors")
# download s2a model ckpt
s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_1layer/model.safetensors")
s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors")
# build model
device = torch.device("cuda")
cfg_path = "./models/tts/maskgct/config/maskgct.json"
cfg = load_config(cfg_path)
# 1. build semantic model (w2v-bert-2.0)
semantic_model, semantic_mean, semantic_std = build_semantic_model(device)
# 2. build semantic codec
semantic_codec = build_semantic_codec(cfg.model.semantic_codec, device)
# 3. build acoustic codec
codec_encoder, codec_decoder = build_acoustic_codec(cfg.model.acoustic_codec, device)
# 4. build t2s model
t2s_model = build_t2s_model(cfg.model.t2s_model, device)
# 5. build s2a model
s2a_model_1layer = build_s2a_model(cfg.model.s2a_model.s2a_1layer, device)
s2a_model_full = build_s2a_model(cfg.model.s2a_model.s2a_full, device)
# load semantic codec
safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
# load acoustic codec
safetensors.torch.load_model(codec_encoder, codec_encoder_ckpt)
safetensors.torch.load_model(codec_decoder, codec_decoder_ckpt)
# load t2s model
safetensors.torch.load_model(t2s_model, t2s_model_ckpt)
# load s2a model
safetensors.torch.load_model(s2a_model_1layer, s2a_1layer_ckpt)
safetensors.torch.load_model(s2a_model_full, s2a_full_ckpt)
# inference
prompt_wav_path = args.audio
save_path = "output.wav"
prompt_text = args.prompt_text
target_text = args.text
# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.
target_len = None
maskgct_inference_pipeline = MaskGCT_Inference_Pipeline(
semantic_model,
semantic_codec,
codec_encoder,
codec_decoder,
t2s_model,
s2a_model_1layer,
s2a_model_full,
semantic_mean,
semantic_std,
device,
)
recovered_audio = maskgct_inference_pipeline.maskgct_inference(
prompt_wav_path, prompt_text, target_text,args.language,args.target_language, target_len=target_len
)
sf.write(save_path, recovered_audio, 24000)
The first inference will download 10 G's of models in the hf_download directory.
The reasoning process takes up 11G of video memory:
If you have less than 11G of video memory, then be sure to turn on the system memory fallback policy in the Nvidia control panel to top up your video memory through system memory:
If you wish, you can also write a simple webui interface based on gradio, app.py:.
import os
import gc
import re
import gradio as gr
import numpy as np
import subprocess
os.environ['HF_HOME'] = os.path.join(os.path.dirname(__file__), 'hf_download')
# 设置HF_ENDPOINT环境变量
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
reference_wavs = ["请选择参考音频或者自己上传"]
for name in os.listdir("./参考音频/"):
reference_wavs.append(name)
def change_choices():
reference_wavs = ["请选择参考音频或者自己上传"]
for name in os.listdir("./参考音频/"):
reference_wavs.append(name)
return {"choices":reference_wavs, "__type__": "update"}
def change_wav(audio_path):
text = audio_path.replace(".wav","").replace(".mp3","").replace(".WAV","")
# text = replace_speaker(text)
return f"./参考音频/{audio_path}",text
def do_cloth(gen_text_input,ref_audio_input,model_choice_text,model_choice_re,ref_text_input):
cmd = fr'.\py311_cu118\python.exe local_test.py -t "{gen_text_input}" -p "{ref_text_input}" -a "{ref_audio_input}" -l {model_choice_re} -lt {model_choice_text} '
print(cmd)
res = subprocess.Popen(cmd)
res.wait()
return "output.wav"
with gr.Blocks() as app_demo:
gr.Markdown(
"""
项目地址:https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct
整合包制作:刘悦的技术博客 https://space.bilibili.com/3031494
"""
)
gen_text_input = gr.Textbox(label="生成文本", lines=4)
model_choice_text = gr.Radio(
choices=["zh", "en"], label="生成文本语种", value="zh",interactive=True)
wavs_dropdown = gr.Dropdown(label="参考音频列表",choices=reference_wavs,value="选择参考音频或者自己上传",interactive=True)
refresh_button = gr.Button("刷新参考音频")
refresh_button.click(fn=change_choices, inputs=[], outputs=[wavs_dropdown])
ref_audio_input = gr.Audio(label="Reference Audio", type="filepath")
ref_text_input = gr.Textbox(
label="Reference Text",
info="Leave blank to automatically transcribe the reference audio. If you enter text it will override automatic transcription.",
lines=2,
)
model_choice_re = gr.Radio(
choices=["zh", "en"], label="参考音频语种", value="zh",interactive=True
)
wavs_dropdown.change(change_wav,[wavs_dropdown],[ref_audio_input,ref_text_input])
generate_btn = gr.Button("Synthesize", variant="primary")
audio_output = gr.Audio(label="Synthesized Audio")
generate_btn.click(do_cloth,[gen_text_input,ref_audio_input,model_choice_text,model_choice_re,ref_text_input],[audio_output])
def main():
global app_demo
print(f"Starting app...")
app_demo.launch(inbrowser=True)
if __name__ == "__main__":
main()
And of course, don't forget to install the gradio dependency: the
pip3 install -U gradio
The run effect looks like this:
concluding remarks
The advantage of MaskGCT model is that the tone and rhythm level is very outstanding, comparable to the real voice, the disadvantage is also obvious, the running cost is high, and the optimization of engineering level is insufficient.MaskGCT project home page has its commercial version of the model of the entrance, according to this inference, the official will not be too much in the open source version of the force, and finally, a one-click integration package, and the folks with the same enjoyment::.