Llasa 1~8B: an open source text-to-speech model for high quality speech generation and cloning

Latest AI Resources6mos agoupdate AI Sharing Circle

1.8K 00

General Introduction

Llasa-3B is an open source text-to-speech (TTS) model developed by the Audio Lab of the Hong Kong University of Science and Technology (HKUST Audio). The model is based on the Llama 3.2B architecture, which has been carefully tuned to provide high-quality speech generation that not only supports multiple languages, but also enables emotional expression and personalized speech cloning.Llasa-3B has attracted the attention of many researchers and developers for its expressiveness and flexibility in natural speech synthesis.

Experience: https://huggingface.co/spaces/srinivasbilla/llasa-3b-tts

Function List

text-to-speech: Converts text into natural, smooth sound.
voice cloning: Only a 15-second audio clip is needed to clone a specific human voice, including timbre and emotion.
Multi-language support: Chinese and English are supported, with the aim of expanding to more languages.
affective expression: The ability to inject emotion into the generated speech enhances the authenticity of the speech.
Multi-model support: 1B and 3B parametric scale models available, with 8B models coming in the future
open weightingAll models provide open weights that can be used directly or fine-tuned twice by developers, and support both Transformers and vLLM Framing.

Using Help

Installation and environment preparation

To use the Llasa-3B model, you first need to prepare the following environment:

Python environment: Python 3.9 or above is recommended.
Related libraries: Requires installation oftorch, transformers, xcodec2etc. library.

conda create -n xcodec2 python=3.9
conda activate xcodec2
pip install transformers torch xcodec2==0.1.3

Model Download and Loading

Visit Hugging Face on theLlasa-3B pageYou can use Hugging Face's model download function directly:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval()
model.to('cuda')  # 如果有GPU

text-to-speech process

Prepared text::
- Enter the text you wish to convert to speech.

Text Preprocessing::

Use a specific formatted text to guide the model for speech generation, for example:

input_text = "这是一个测试文本，请转成语音。"
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"

Generate Speech::

Convert the text into a token that the model can understand:

chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True)
input_ids = input_ids.to('cuda')

Generate a voice token:

speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)

voice decoding::

Converts the generated token back to audio:

from xcodec2.modeling_xcodec2 import XCodec2Model
model_path = "HKUST-Audio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path).eval().cuda()
generated_ids = outputs[0][input_ids.shape[1]:-1]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_ids = [int(token[4:-2]) for token in speech_tokens if token.startswith('<|s_') and token.endswith('|>')]
speech_tokens_tensor = torch.tensor(speech_ids).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = Codec_model.decode_code(speech_tokens_tensor)
sf.write("output.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)

voice cloning

Record or prepare about 15 seconds of the original sound track::
- Use a recording device or provide an existing audio file.

phonetic cloning process::

Encode the original sound frequencies into a codebook that the model can use:

prompt_wav = sf.read("your_source_audio.wav")[0]  # 必须是16kHz采样率
vq_code_prompt = Codec_model.encode_code(torch.from_numpy(prompt_wav).unsqueeze(0).unsqueeze(0).cuda())

Add audio cues to the text generation process:

speech_ids_prefix = [f"<|s_{id}|>"foridin vq_code_prompt[0, 0, :].tolist()]
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
]
# 后续步骤与文本转语音相同

caveat

Make sure the audio input format is correct, the Llasa-3B only supports 16kHz audio.
The performance of the model is directly affected by the quality of the input text and audio, ensuring the quality of the input.