General Introduction
MaskGCT (Masked Generative Codec Transformer) is a fully non-autoregressive Text-to-Speech (TTS) model jointly introduced by Funky Maru Technology and The Chinese University of Hong Kong. The model eliminates the need for explicit text-to-speech alignment information and adopts a two-stage generation approach, first predicting semantic coding from text, and then generating acoustic coding from semantic coding.MaskGCT performs well in the zero-sample TTS task, providing high-quality, similar, and easy-to-understand speech output.
Public Beta Product: Funmaru Chiyo, Voice Cloning and Video Multilingual Translation Tool
Thesis: https://arxiv.org/abs/2409.00750
Function List
- Text-to-speech conversion (TTS): Converts input text to speech output.
- semantic encoding: Converts speech into semantic coding for subsequent processing.
- acoustic code: Convert semantic coding to acoustic coding and reconstruct the audio waveform.
- zero-sample learning: High-quality speech synthesis without explicit alignment information.
- Pre-trained models: A wide range of pre-trained models are available to support rapid deployment and utilization.
Using Help
Installation process
- cloning project::
git clone https://github.com/open-mmlab/Amphion.git
- Create an environment and install dependencies::
bash . /models/tts/maskgct/env.sh
Usage Process
- Download pre-trained model: The required pre-trained models can be downloaded from HuggingFace:
from huggingface_hub import hf_hub_download # download semantic coding model semantic_code_ckpt = hf_hub_download("amphion/MaskGCT", filename="semantic_codec/model.safetensors") # Download acoustic coding model codec_encoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model.safetensors") codec_decoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors") # Download the TTS model t2s_model_ckpt = hf_hub_download("amphion/MaskGCT", filename="t2s_model/model.safetensors")
- Generate Speech: Use the following code to generate speech from text:
# Import the necessary libraries from amphion.models.tts.maskgct import MaskGCT # Initialize the model model = MaskGCT() # Input text text = "Hello, welcome to the MaskGCT model." # Generate speech audio = model.text_to_speech(text) # Save the generated speech with open("output.wav", "wb") as f. f.write(audio)
- model training: If you need to train your own model, you can refer to the training scripts and configuration files in the project for data preparation and model training.
caveat
- Environment Configuration: Ensure that all necessary dependent libraries are installed and environment variables are configured correctly.
- Data preparation: Training with high quality speech data for better speech synthesis.
- model optimization: Adjust model parameters and training strategies to achieve optimal performance according to specific application scenarios.
Local deployment tutorial (with local one-click installer)
A few days ago, another non-autoregressive text-to-speech AI model, MaskGCT, opened its source code. Like the F5-TTS model, which is also non-autoregressive, the MaskGCT model is trained on the 100,000-hour dataset Emilia, and is proficient in cross-language synthesis of six languages, namely, Chinese, English, Japanese, Korean, French and German. The dataset Emilia is one of the largest and most diverse high-quality multilingual speech datasets in the world.
This time, we share how to deploy the MaskGCT project locally to get your graphics card firing again.
Installation of basic dependencies
First of all, make sure that Python 3.11 is installed locally, you can go to Python's official download package.
python.org
Subsequent cloning of the official program.
git clone https://github.com/open-mmlab/Amphion.git
Official linux-based installation shell scripts are provided:
pip install setuptools ruamel.yaml tqdm
pip install tensorboard tensorboardX torch==2.0.1
pip install transformers===4.41.1
pip install -U encodec
pip install black==24.1.1
pip install oss2
sudo apt-get install espeak-ng
pip install phonemizer
pip install g2p_en
pip install accelerate==0.31.0
pip install funasr zhconv zhon modelscope
# pip install git+https://github.com/lhotse-speech/lhotse
pip install timm
pip install jieba cn2an
pip install unidecode
pip install -U cos-python-sdk-v5
pip install pypinyin
pip install jiwer
pip install omegaconf
pip install pyworld
pip install py3langid==0.2.2 LangSegment
pip install onnxruntime
pip install pyopenjtalk
pip install pykakasi
pip install -U openai-whisper
Here the author converts the requirements.txt dependency file for Windows:
setuptools
ruamel.yaml
tqdm
transformers===4.41.1
encodec
black==24.1.1
oss2
phonemizer
g2p_en
accelerate==0.31.0
funasr
zhconv
zhon
modelscope
timm
jieba
unidecode
unidecode
cos-python-sdk-v5
cos-python-sdk-v5
cos-python-sdk
cos-python-sdk-v5
pyworld
py3langid==0.2.2
LangSegment
onnxruntime
pyopenjtalk
pykakasi
openai-whisper
json5
Run command:
pip3 install -r requirements.txt
Just install the dependencies.
Install onnxruntime-gpu.
pip3 install onnxruntime-gpu
Installation of the torch 3-piece set.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Windows configuration espeak-ng
Since the backend of the MaskGCT project relies on the espeak software, it needs to be configured locally. eSpeak is a compact, open source text-to-speech (TTS) synthesizer that supports multiple languages and accents. It uses a "resonance peak synthesis" approach that allows for multiple languages in a small footprint. The speech is clear and can be used at high speeds, but it is not as natural and smooth as larger synthesizers based on recordings of human speech, and MaskGCT builds on espeak's synthesis with secondary reasoning.
First run the command to install espeak:
winget install espeak
If you can't install it, you can also download the installer and install it manually:
https://sourceforge.net/projects/espeak/files/espeak/espeak-1.48/setup_espeak-1.48.04.exe/download
Then download the espeak-ng installer:
https://github.com/espeak-ng/espeak-ng/releases
Download and double-click to install.
Then copy C:\Program Files\eSpeak NG\libespeak-ng.dll to the C:\Program Files (x86)\eSpeak\command_line directory.
Then rename libespeak-ng.dll to espeak-ng.dll
Finally, just configure the C:\Program Files (x86)\eSpeak\command_line directory to the environment variable.
MaskGCT Local Reasoning
With all that configured, write the inference script local_test.py:
from models.tts.maskgct.maskgct_utils import *
from huggingface_hub import hf_hub_download
import safetensors
import soundfile as sf
import os
import argparse
os.environ['HF_HOME'] = os.path.join(os.path.dirname(__file__), 'hf_download')
print(os.path.join(os.path.dirname(__file__), 'hf_download'))
parser = argparse.ArgumentParser(description="GPT-SoVITS api")
parser.add_argument("-p", "--prompt_text", type=str, default="Speak as if I've done well on the exam a few times since you took me")
parser.add_argument("-a", "--audio", type=str, default=". /said as if I've done well a few times since you took me.wav")
parser.add_argument("-t", "--text", type=str, default="Hello")
parser.add_argument("-l", "--language", type=str, default="zh")
parser.add_argument("-lt", "---target_language", type=str, default="zh")
args = parser.parse_args()
if __name__ == "__main__".
# download semantic codec ckpt
semantic_code_ckpt = hf_hub_download("amphion/MaskGCT", filename="semantic_codec/model.safetensors")
# download acoustic codec ckpt
codec_encoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model.safetensors")
codec_decoder_ckpt = hf_hub_download("amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors")
# download t2s model ckpt
t2s_model_ckpt = hf_hub_download("amphion/MaskGCT", filename="t2s_model/model.safetensors")
# download s2a model ckpt
s2a_1layer_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_1layer/model.safetensors")
s2a_full_ckpt = hf_hub_download("amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors")
# build model
device = torch.device("cuda")
cfg_path = ". /models/tts/maskgct/config/maskgct.json"
cfg = load_config(cfg_path)
# 1. build semantic model (w2v-bert-2.0)
semantic_model, semantic_mean, semantic_std = build_semantic_model(device)
# 2. build semantic codec
semantic_codec = build_semantic_codec(cfg.model.semantic_codec, device)
# 3. build acoustic codec
codec_encoder, codec_decoder = build_acoustic_codec(cfg.model.acoustic_codec, device)
# 4. build t2s model
t2s_model = build_t2s_model(cfg.model.t2s_model, device)
# 5. build s2a model
s2a_model_1layer = build_s2a_model(cfg.model.s2a_model.s2a_1layer, device)
s2a_model_full = build_s2a_model(cfg.model.s2a_model.s2a_full, device)
# load semantic codec
safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
# load acoustic codec
safetensors.torch.load_model(codec_encoder, codec_encoder_ckpt)
safetensors.torch.load_model(codec_decoder, codec_decoder_ckpt)
# load t2s model
safetensors.torch.load_model(t2s_model, t2s_model_ckpt)
# load s2a model
safetensors.torch.load_model(s2a_model_1layer, s2a_1layer_ckpt)
safetensors.torch.load_model(s2a_model_full, s2a_full_ckpt)
# inference
prompt_wav_path = args.audio
save_path = "output.wav"
prompt_text = args.prompt_text
target_text = args.text
# Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration. target_len = None, we use a simple rule to predict the target duration.
If target_len = None, we use a simple rule to predict the target duration. target_len = None
maskgct_inference_pipeline = MaskGCT_Inference_Pipeline(
maskgct_inference_pipeline = MaskGCT_Inference_Pipeline(
semantic_model, semantic_codec,
codec_decoder, t2s_model, t2s_model
t2s_model,
s2a_model_1layer, s2a_model_full, s2a_model_1layer
s2a_model_full,
semantic_mean, semantic_std, s2a_model_full, semantic_mean
semantic_std,
device, )
)
recovered_audio = maskgct_inference_pipeline.maskgct_inference(
prompt_wav_path, prompt_text, target_text,args.language,args.target_language, target_len=target_len
)
sf.write(save_path, recovered_audio, 24000)
The first inference will download 10 G's of models in the hf_download directory.
The reasoning process takes up 11G of video memory:
If you have less than 11G of video memory, then be sure to turn on the system memory fallback policy in the Nvidia control panel to top up your video memory through system memory:
If you wish, you can also write a simple webui interface based on gradio, app.py:.
import os
import gc
import re
import gradio as gr
import numpy as np
import subprocess
os.environ['HF_HOME'] = os.path.join(os.path.dirname(__file__), 'hf_download')
# Setting the HF_ENDPOINT environment variable
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
reference_wavs = ["Please select reference audio or upload it yourself"]
for name in os.listdir(". /Reference_Audio/"):: reference_wavs.
reference_wavs.append(name)
def change_choices(): reference_wavs = [reference_wavs.append(name)
reference_wavs = ["Please choose reference audio or upload it yourself"]
for name in os.listdir(". /reference_wavs/"): reference_wavs.
reference_wavs.append(name)
return {"choices":reference_wavs, "__type__": "update"}
def change_wav(audio_path): text = audio_path.replace().
text = audio_path.replace(".wav","").replace(".mp3","").replace(".WAV","")
# text = replace_speaker(text)
return f". /reference audio/{audio_path}",text
def do_cloth(gen_text_input,ref_audio_input,model_choice_text,model_choice_re,ref_text_input):
cmd = fr'. \py311_cu118\python.exe local_test.py -t "{gen_text_input}" -p "{ref_text_input}" -a "{ref_audio_input}" -l {model_choice_re} -lt {model_ choice_text} '
print(cmd)
res = subprocess.Popen(cmd)
res.wait()
return "output.wav"
with gr.Blocks() as app_demo.
gr.Markdown(
"""
Project address:https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct
Integration package production:Liu Yue's technology blog https://space.bilibili.com/3031494
"""
)
gen_text_input = gr.Textbox(label="Generate Text", lines=4)
model_choice_text = gr.Radio(
choices=["zh", "en"], label="Generate Text Language", value="zh",interactive=True)
wavs_dropdown = gr.Dropdown(label="Reference Audio List", choices=reference_wavs, value="Choose reference audio or upload yourself", interactive=True)
refresh_button = gr.Button("Refresh Reference Audio")
refresh_button.click(fn=change_choices, inputs=[], outputs=[wavs_dropdown])
ref_audio_input = gr.Audio(label="Reference Audio", type="filepath")
ref_text_input = gr.Textbox(
label="Reference Text", info="Leave blank to automatically transcript")
info="Leave blank to automatically transcribe the reference audio. If you enter text it will override automatic transcription.", lines=2, ref_text_input = gr.
If you enter text it will override automatic transcription.", lines=2, )
)
model_choice_re = gr.Radio(
choices=["zh", "en"], label="Reference audio language", value="zh",interactive=True
)
wavs_dropdown.change(change_wav,[wavs_dropdown],[ref_audio_input,ref_text_input])
generate_btn = gr.Button("Synthesize", variant="primary")
audio_output = gr.Audio(label="Synthesized Audio")
generate_btn.click(do_cloth,[gen_text_input,ref_audio_input,model_choice_text,model_choice_re,ref_text_input],[audio_output])
def main(): global app_demo
global app_demo
print(f "Starting app...")
app_demo.launch(inbrowser=True)
if __name__ == "__main__".
main()
And of course, don't forget to install the gradio dependency: the
pip3 install -U gradio
The run effect looks like this:
concluding remarks
The advantage of MaskGCT model is that the tone and rhythm level is very outstanding, comparable to the real voice, the disadvantage is also obvious, the running cost is high, and the optimization of engineering level is insufficient.MaskGCT project home page has its commercial version of the model of the entrance, according to this inference, the official will not be too much in the open source version of the force, and finally, a one-click integration package, and the folks with the same enjoyment::.