AI Personal Learning
and practical guidance

Llasa 1~8B: an open source text-to-speech model for high quality speech generation and cloning

General Introduction

Llasa-3B is an open source text-to-speech (TTS) model developed by the Audio Lab of the Hong Kong University of Science and Technology (HKUST Audio). The model is based on the Llama 3.2B architecture, which has been carefully tuned to provide high-quality speech generation that not only supports multiple languages, but also enables emotional expression and personalized speech cloning.Llasa-3B has attracted the attention of many researchers and developers for its expressiveness and flexibility in natural speech synthesis.

Llasa 1~8B: An Open Source Text-to-Speech Model for High Quality Speech Generation and Cloning-1

Experience: https://huggingface.co/spaces/srinivasbilla/llasa-3b-tts


 

Function List

  • text-to-speech: Converts text into natural, smooth sound.
  • voice cloning: Only a 15-second audio clip is needed to clone a specific human voice, including timbre and emotion.
  • Multi-language support: Chinese and English are supported, with the aim of expanding to more languages.
  • affective expression: The ability to inject emotion into the generated speech enhances the authenticity of the speech.
  • Multi-model support: 1B and 3B parametric scale models available, with 8B models coming in the future
  • open weightingAll models are provided with open weights that can be used directly or fine-tuned twice by developers, and support Transformers and vLLM frameworks.

 

Using Help

Installation and environment preparation

To use the Llasa-3B model, you first need to prepare the following environment:

Python environment: Python 3.9 or above is recommended.
Related libraries: Requires installation oftorchtransformersxcodec2etc. library.

conda create -n xcodec2 python=3.9
conda activate xcodec2
pip install transformers torch xcodec2==0.1.3

Model Download and Loading

Visit Hugging Face on theLlasa-3B pageYou can use Hugging Face's model download function directly:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval()
model.to('cuda') # If there is a GPU

text-to-speech process

  1. Prepared text::
    • Enter the text you wish to convert to speech.
  2. Text Preprocessing::
    • Use a specific formatted text to guide the model for speech generation, for example:
      input_text = "This is a test text, please convert to speech."
      formatted_text = f"{input_text}"
      
  3. Generate Speech::
    • Convert the text into a token that the model can understand:
      chat = [
      {"role": "user", "content": "Convert the text to speech:" + formatted_text},
      {"role": "assistant", "content": ""}
      ]
      input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True)
      input_ids = input_ids.to('cuda')
      
    • Generate a voice token:
      speech_end_id = tokenizer.convert_tokens_to_ids('')
      outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)
      
  4. voice decoding::
    • Converts the generated token back to audio:
      from xcodec2.modeling_xcodec2 import XCodec2Model
      model_path = "HKUST-Audio/xcodec2"
      Codec_model = XCodec2Model.from_pretrained(model_path).eval().cuda()
      generated_ids = outputs[0][input_ids.shape[1]:-1]
      speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
      speech_ids = [int(token[4:-2]) for token in speech_tokens if token.startswith('')]
      speech_tokens_tensor = torch.tensor(speech_ids).cuda().unsqueeze(0).unsqueeze(0)
      gen_wav = Codec_model.decode_code(speech_tokens_tensor)
      sf.write("output.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
      

voice cloning

  • Record or prepare about 15 seconds of the original sound track::
    • Use a recording device or provide an existing audio file.
  • phonetic cloning process::
    • Encode the original sound frequencies into a codebook that the model can use:
      prompt_wav = sf.read("your_source_audio.wav")[0] # must be 16kHz sample rate
      vq_code_prompt = Codec_model.encode_code(torch.from_numpy(prompt_wav).unsqueeze(0).unsqueeze(0).cuda())
      
    • Add audio cues to the text generation process:
      speech_ids_prefix = [f""foridin vq_code_prompt[0, 0, :].tolist()]
      chat = [
      {"role": "user", "content": "Convert the text to speech:" + formatted_text},
      {"role": "assistant", "content": "" + ''.join(speech_ids_prefix)}
      ]
      # The subsequent steps are the same as for text-to-speech
      

caveat

  • Make sure the audio input format is correct, the Llasa-3B only supports 16kHz audio.
  • The performance of the model is directly affected by the quality of the input text and audio, ensuring the quality of the input.
May not be reproduced without permission:Chief AI Sharing Circle " Llasa 1~8B: an open source text-to-speech model for high quality speech generation and cloning

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish