General Introduction
CSM Voice Cloning is an open source project developed by Isaiah Bjork and hosted on GitHub. It is based on the Sesame CSM-1B model, which allows users to clone their own voices and generate their own personalized voice by simply providing an audio sample. The tool supports both local GPU runs and Modal cloud runs, making it suitable for content creators, developers, or anyone interested in voice technology. Although the cloning effect is not the most perfect, the generated voice retains some of the characteristics of the target voice and the effect is recognizable. It requires some technical foundation, such as installing Python and configuring the environment, but a detailed guide is officially available. The project is completely free and the community is welcome to contribute code improvements.
Function List
- Speech Cloning: upload audio samples to generate speech that sounds similar to the sample.
- Text-to-speech: Enter text and generate audio files with cloned sounds.
- Running locally: Use your personal GPU to process speech generation tasks.
- Runs in the cloud: Accelerated by cloud GPUs through the Modal platform.
- Open source support: the code is open to the public and can be modified or optimized by the user.
- Supports common audio formats: accepts MP3 or WAV files as samples.
- Parameter Adjustment: Allows the user to adjust model settings to accommodate different lengths of audio.
Using Help
Installation process
To use CSM Voice Cloning, users need to set up the runtime environment first. Below are the detailed steps:
Running the installation locally
- Check hardware and software requirements
- Requires Python 3.10 or later.
- NVIDIA CUDA-compatible graphics cards and sufficient video memory are required for local operation.
- Make sure you have an internet connection to download the model and dependencies.
- Clone Code Repository
- Open a terminal (CMD or PowerShell for Windows, Bash for Linux/Mac).
- Enter the command:
git clone https://github.com/isaiahbjork/csm-voice-cloning.git cd csm-voice-cloning
- Installing dependencies
- Runs in the terminal:
pip install -r requirements.txt
- This will install the necessary libraries such as PyTorch, Hugging Face, etc.
- Runs in the terminal:
Cloud Run Installation (Modal)
- Installing Modal
- Runs in the terminal:
pip install modal
- Runs in the terminal:
- Configuring Modal Authentication
- Enter the command:
modal token new
- Follow the prompts to log in to your Modal account or create a new account.
- Enter the command:
Configuring Hugging Face Accounts
- Register and get a token
- Visit the Hugging Face website to register or login.
- exist Sesame CSM-1B Model Page Click on "Access repository" and accept the terms.
- Generate API tokens: Click on your avatar in the upper right corner -> Settings -> Tokens -> New Token.
- Setup Token
- Method 1: Type in the terminal:
export HF_TOKEN="Your token"
- Method 2: Modification
voice_clone.py
file, find theos.environ["HF_TOKEN"]
, fill in the token.
- Method 1: Type in the terminal:
Preparing audio samples
- Record Audio
- Record a clear 2-3 minute audio clip, preferably without background noise.
- Save in MP3 or WAV format, e.g.
sample.mp3
The
- Transcription of text
- expense or outlay Whisper or other tool to transcribe the audio content, noting down the exact text (e.g., "Hello, this is my test audio").
Main Functions
native speaker clone (computing)
- Edit parameters
- show (a ticket)
voice_clone.py
file, modify the following:context_audio_path = "sample.mp3"
(audio path).context_text = "Hello, this is my test audio"
(transcribed text).text = "It's a beautiful day today"
(text to be generated).output_filename = "output.wav"
(output file name).
- show (a ticket)
- running program
- Enter it in the terminal:
python voice_clone.py
- The generated audio is saved in the project folder.
- Enter it in the terminal:
Cloud-based voice cloning (Modal)
- Edit parameters
- show (a ticket)
modal_voice_cloning.py
file, setting the same parameters as local:context_audio_path = "sample.mp3"
Thecontext_text = "Hello, this is my test audio"
Thetext = "It's a beautiful day today"
Theoutput_filename = "output.wav"
The
- show (a ticket)
- running program
- Enter it in the terminal:
modal run modal_voice_cloning.py
- Modal will use the cloud GPU to process the task and download the output file when it's done.
- Enter it in the terminal:
Adjusting the model sequence length
- If the audio sample is long (more than 2-3 minutes), tensor dimension errors may be encountered.
- Solution:
- show (a ticket)
models.py
Documentation. - locate
llama3_2_1B()
function that modifies themax_seq_len
Parameters:def llama3_2_1B(). return llama3_2.llama3_2(max_seq_len=4096, ...)
- assure
llama3_2_100M()
value is the same, save it and re-run it.
- show (a ticket)
Featured Function Operation
Cloud Acceleration (Modal)
- Modal offers cloud GPUs for those without powerful local devices.
- Simple to use, just install Modal and run the appropriate scripts for faster than local processing.
Processing long audio
- The default setting is for samples up to 2 minutes 50 seconds.
- Longer audio needs to be adjusted
max_seq_len
(as described above), or clip the sample to the recommended length.
Frequently Asked Questions
- Tensor dimension error
risemax_seq_len
values, or shorten the audio samples. - CUDA Out of memory
Use shorter samples, or switch to a Modal cloud run. - Model download failed
Check Hugging Face tokens and networks to ensure that the model terms have been accepted.
application scenario
- content creation
- Scene Description
Anchors can generate video narration in their own voice. Upload a self-introduction audio, enter the script, and generate the voice in a few minutes, eliminating the need for repeated recordings.
- Scene Description
- Educational support
- Scene Description
The teacher clones his/her own voice and inputs the course lectures to generate teaching audio. Students can listen back at any time, suitable for distance learning.
- Scene Description
- game development
- Scene Description
Developers voice game characters. Record a few samples to generate multiple pieces of dialog to enhance character realism.
- Scene Description
QA
- How long does the audio sample need to be?
Recommended 2-3 minutes. Too short for poor results, too long to adjust parameters. - Why doesn't the generated voice look much like me?
The modeling effect is limited, preserving sound characteristics but not perfect. Make sure the sample is clear and try several times with different texts. - What's the difference between Modal and running locally?
Modal uses cloud GPUs and is fast for users without powerful devices. Local running is free but requires a good graphics card.