CSM Voice Cloning: Fast Voice Cloning with the CSM-1B

Latest AI Resources5mos agorelease AI Sharing Circle

1.6K 00

General Introduction

CSM Voice Cloning is an open source project developed by Isaiah Bjork and hosted on GitHub. It is based on the Sesame CSM-1B model, which allows users to clone their own voices and generate their own personalized voice by simply providing an audio sample. The tool supports both local GPU runs and Modal cloud runs, making it suitable for content creators, developers, or anyone interested in voice technology. Although the cloning effect is not the most perfect, the generated voice retains some of the characteristics of the target voice and the effect is recognizable. It requires some technical foundation, such as installing Python and configuring the environment, but a detailed guide is officially available. The project is completely free and the community is welcome to contribute code improvements.

Function List

Speech Cloning: upload audio samples to generate speech that sounds similar to the sample.
Text-to-speech: Enter text and generate audio files with cloned sounds.
Running locally: Use your personal GPU to process speech generation tasks.
Runs in the cloud: Accelerated by cloud GPUs through the Modal platform.
Open source support: the code is open to the public and can be modified or optimized by the user.
Supports common audio formats: accepts MP3 or WAV files as samples.
Parameter Adjustment: Allows the user to adjust model settings to accommodate different lengths of audio.

Using Help

Installation process

To use CSM Voice Cloning, users need to set up the runtime environment first. Below are the detailed steps:

Running the installation locally

Check hardware and software requirements
- Requires Python 3.10 or later.
- NVIDIA CUDA-compatible graphics cards and sufficient video memory are required for local operation.
- Make sure you have an internet connection to download the model and dependencies.
Clone Code Repository
- Open a terminal (CMD or PowerShell for Windows, Bash for Linux/Mac).
- Enter the command:
```
git clone https://github.com/isaiahbjork/csm-voice-cloning.git
cd csm-voice-cloning
```
Installing dependencies
- Runs in the terminal:
```
pip install -r requirements.txt
```
- This will install the necessary libraries such as PyTorch, Hugging Face, etc.

Cloud Run Installation (Modal)

Installing Modal
- Runs in the terminal:
```
pip install modal
```
Configuring Modal Authentication
- Enter the command:
```
modal token new
```
- Follow the prompts to log in to your Modal account or create a new account.

Configuring Hugging Face Accounts

Register and get a token
- Visit the Hugging Face website to register or login.
- exist Sesame CSM-1B Model Page Click on "Access repository" and accept the terms.
- Generate API tokens: Click on your avatar in the upper right corner -> Settings -> Tokens -> New Token.
Setup Token
- Method 1: Type in the terminal:
```
export HF_TOKEN="你的令牌"
```
- Method 2: Modification voice_clone.py file, find the os.environ["HF_TOKEN"], fill in the token.

Preparing audio samples

Record Audio
- Record a clear 2-3 minute audio clip, preferably without background noise.
- Save in MP3 or WAV format, e.g. sample.mp3The
Transcription of text
- expense or outlay Whisper or other tool to transcribe the audio content, noting down the exact text (e.g., "Hello, this is my test audio").

Main Functions

native speaker clone (computing)

Edit parameters
- show (a ticket) voice_clone.py file, modify the following:
  - context_audio_path = "sample.mp3"(audio path).
  - context_text = "你好，这是我的测试音频"(transcribed text).
  - text = "今天天气很好"(text to be generated).
  - output_filename = "output.wav"(output file name).
running program
- Enter it in the terminal:
```
python voice_clone.py
```
- The generated audio is saved in the project folder.

Cloud-based voice cloning (Modal)

Edit parameters
- show (a ticket) modal_voice_cloning.py file, setting the same parameters as local:
  - context_audio_path = "sample.mp3"The
  - context_text = "你好，这是我的测试音频"The
  - text = "今天天气很好"The
  - output_filename = "output.wav"The
running program
- Enter it in the terminal:
```
modal run modal_voice_cloning.py
```
- Modal will use the cloud GPU to process the task and download the output file when it's done.

Adjusting the model sequence length

If the audio sample is long (more than 2-3 minutes), tensor dimension errors may be encountered.
Solution:
1. show (a ticket) models.py Documentation.
2. locate llama3_2_1B() function that modifies the max_seq_len Parameters:
```
def llama3_2_1B():
return llama3_2.llama3_2(max_seq_len=4096, ...)
```
3. assure llama3_2_100M() value is the same, save it and re-run it.

Featured Function Operation

Cloud Acceleration (Modal)

Modal offers cloud GPUs for those without powerful local devices.
Simple to use, just install Modal and run the appropriate scripts for faster than local processing.

Processing long audio

The default setting is for samples up to 2 minutes 50 seconds.
Longer audio needs to be adjusted max_seq_len(as described above), or clip the sample to the recommended length.

Frequently Asked Questions

Tensor dimension error
rise max_seq_len values, or shorten the audio samples.
CUDA Out of memory
Use shorter samples, or switch to a Modal cloud run.
Model download failed
Check Hugging Face tokens and networks to ensure that the model terms have been accepted.

application scenario

content creation
- Scene Description
  Anchors can generate video narration in their own voice. Upload a self-introduction audio, enter the script, and generate the voice in a few minutes, eliminating the need for repeated recordings.
Educational support
- Scene Description
  The teacher clones his/her own voice and inputs the course lectures to generate teaching audio. Students can listen back at any time, suitable for distance learning.
game development
- Scene Description
  Developers voice game characters. Record a few samples to generate multiple pieces of dialog to enhance character realism.

QA

How long does the audio sample need to be?
Recommended 2-3 minutes. Too short for poor results, too long to adjust parameters.
Why doesn't the generated voice look much like me?
The modeling effect is limited, preserving sound characteristics but not perfect. Make sure the sample is clear and try several times with different texts.
What's the difference between Modal and running locally?
Modal uses cloud GPUs and is fast for users without powerful devices. Local running is free but requires a good graphics card.