General Introduction
Llasa-3B is an open source text-to-speech (TTS) model developed by the Audio Lab of the Hong Kong University of Science and Technology (HKUST Audio). The model is based on the Llama 3.2B architecture, which has been carefully tuned to provide high-quality speech generation that not only supports multiple languages, but also enables emotional expression and personalized speech cloning.Llasa-3B has attracted the attention of many researchers and developers for its expressiveness and flexibility in natural speech synthesis.
Function List
- text-to-speech: Converts text into natural, smooth sound.
- voice cloning: Only a 15-second audio clip is needed to clone a specific human voice, including timbre and emotion.
- Multi-language support: Chinese and English are supported, with the aim of expanding to more languages.
- affective expression: The ability to inject emotion into the generated speech enhances the authenticity of the speech.
- Multi-model support: 1B and 3B parametric scale models available, with 8B models coming in the future
- open weightingAll models are provided with open weights that can be used directly or fine-tuned twice by developers, and support Transformers and vLLM frameworks.
Using Help
Installation and environment preparation
To use the Llasa-3B model, you first need to prepare the following environment:
Python environment: Python 3.9 or above is recommended.
Related libraries: Requires installation oftorch
, transformers
, xcodec2
etc. library.
conda create -n xcodec2 python=3.9
conda activate xcodec2
pip install transformers torch xcodec2==0.1.3
Model Download and Loading
Visit Hugging Face on theLlasa-3B pageYou can use Hugging Face's model download function directly:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval()
model.to('cuda') # If there is a GPU
text-to-speech process
- Prepared text::
- Enter the text you wish to convert to speech.
- Text Preprocessing::
- Use a specific formatted text to guide the model for speech generation, for example:
input_text = "This is a test text, please convert to speech." formatted_text = f"{input_text}"
- Use a specific formatted text to guide the model for speech generation, for example:
- Generate Speech::
- Convert the text into a token that the model can understand:
chat = [ {"role": "user", "content": "Convert the text to speech:" + formatted_text}, {"role": "assistant", "content": ""} ] input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True) input_ids = input_ids.to('cuda')
- Generate a voice token:
speech_end_id = tokenizer.convert_tokens_to_ids('') outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)
- Convert the text into a token that the model can understand:
- voice decoding::
- Converts the generated token back to audio:
from xcodec2.modeling_xcodec2 import XCodec2Model model_path = "HKUST-Audio/xcodec2" Codec_model = XCodec2Model.from_pretrained(model_path).eval().cuda() generated_ids = outputs[0][input_ids.shape[1]:-1] speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) speech_ids = [int(token[4:-2]) for token in speech_tokens if token.startswith('')] speech_tokens_tensor = torch.tensor(speech_ids).cuda().unsqueeze(0).unsqueeze(0) gen_wav = Codec_model.decode_code(speech_tokens_tensor) sf.write("output.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
- Converts the generated token back to audio:
voice cloning
- Record or prepare about 15 seconds of the original sound track::
- Use a recording device or provide an existing audio file.
- phonetic cloning process::
- Encode the original sound frequencies into a codebook that the model can use:
prompt_wav = sf.read("your_source_audio.wav")[0] # must be 16kHz sample rate vq_code_prompt = Codec_model.encode_code(torch.from_numpy(prompt_wav).unsqueeze(0).unsqueeze(0).cuda())
- Add audio cues to the text generation process:
speech_ids_prefix = [f""foridin vq_code_prompt[0, 0, :].tolist()] chat = [ {"role": "user", "content": "Convert the text to speech:" + formatted_text}, {"role": "assistant", "content": "" + ''.join(speech_ids_prefix)} ] # The subsequent steps are the same as for text-to-speech
- Encode the original sound frequencies into a codebook that the model can use:
caveat
- Make sure the audio input format is correct, the Llasa-3B only supports 16kHz audio.
- The performance of the model is directly affected by the quality of the input text and audio, ensuring the quality of the input.