General Introduction
csm-mlx is based on the MLX framework developed by Apple, optimized for the CSM (Conversation Speech Model) voice conversation model specifically for the Apple Silicon. This project allows users to run efficient speech generation and dialog functions on Apple devices in a simple way. Developer senstella released this project on March 15, 2025 with the goal of getting more people to take advantage of the power of Apple devices and explore speech technology. The core of the project is to provide a lightweight, easy-to-use tool that supports generating natural speech and processing dialog scenarios.
Function List
- Speech Generation: Generate natural human voice audio after inputting text.
- Conversation context support: Generate coherent voice replies based on the content of previous conversations.
- Apple device optimization: efficiently running models on Apple silicon using the MLX framework.
- Open source model loading: Support for downloading pre-trained models from Hugging Face (e.g. csm-1b).
- Adjustable Parameters: provides sampler parameter adjustments such as temperature (temp) and minimum probability (min_p) to control the generation effect.
Using Help
Installation process
To use csm-mlx locally, you need to install some dependent tools and environments first. Below are the detailed steps:
- Preparing the environment
- Make sure you're using macOS and that the device is powered by Apple silicon (e.g. M1, M2 chips).
- Install Python 3.10 or later. You can install Python with the command
brew install python@3.10
Installation via Homebrew. - Install Git, run
brew install git
(can be skipped if already installed).
- cloning project
- Open a terminal and enter the following command to download the csm-mlx project:
git clone https://github.com/senstella/csm-mlx.git
- Go to the project folder:
cd csm-mlx
- Open a terminal and enter the following command to download the csm-mlx project:
- Creating a Virtual Environment
- Create a Python virtual environment in the project directory:
python3.10 -m venv .venv
- Activate the virtual environment:
source .venv/bin/activate
- Create a Python virtual environment in the project directory:
- Installation of dependencies
- Install the Python packages needed for the project and run it:
pip install -r requirements.txt
- Note: You need to make sure that the MLX framework and Hugging Face are installed.
huggingface_hub
library. If you encounter problems, you can run a separatepip install mlx huggingface_hub
The
- Install the Python packages needed for the project and run it:
- Download model
- csm-mlx using pre-trained models
csm-1b-mlx
. Run the following code to download automatically:python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='senstella/csm-1b-mlx', filename='ckpt.safetensors')"
- The model files are saved in the local cache directory (usually the
~/.cache/huggingface/hub
).
- csm-mlx using pre-trained models
How to use
Once installed, you can run csm-mlx's speech generation feature with a Python script. Here are the steps to do so:
Basic Speech Generation
- Writing scripts
- Create a file in the project directory, such as
generate_audio.py
, enter the following code:from csm_mlx import CSM, csm_1b, generate from mlx_lm.sample_utils import make_sampler from huggingface_hub import hf_hub_download # Initialize the model csm = CSM(csm_1b()) weights = hf_hub_download(repo_id="senstella/csm-1b-mlx", filename="ckpt.safetensors") csm.load_weights(weight) # Generate audio audio = generate( csm. text="Hello, this is csm-mlx.", speaker=0, context=[], max_audio_length_ms=10000, # Maximum audio length 10 seconds sampler=make_sampler(temp=0.5, min_p=0.1) ) # save audio import audiofile audiofile.write("output.wav", audio, 22050) # 22050 is sample rate
- Note: Saving audio requires the installation of
audiofile
library, run thepip install audiofile
The
- Create a file in the project directory, such as
- Running Scripts
- Enter it in the terminal:
python generate_audio.py
- Running it generates the following in the current directory
output.wav
file, double-click it to play it.
- Enter it in the terminal:
Adding Context to a Conversation
- Modifying the Script Support Context
- If you want the model to generate responses based on previous conversations, you can add the
context
Parameters. The modification code is as follows:from csm_mlx import CSM, csm_1b, generate, Segment import mlx.core as mx from huggingface_hub import hf_hub_download # Initialize the model csm = CSM(csm_1b()) weights = hf_hub_download(repo_id="senstella/csm-1b-mlx", filename="ckpt.safetensors") csm.load_weights(weight) # Create a dialog context context = [ Segment(speaker=0, text="Hello, what's the weather like today?" , audio=mx.array([...])) , Segment(speaker=1, text="It's nice and sunny." , audio=mx.array([...])) ] # Generate a reply audio = generate( csm. text="Let's go for a walk then!" , speaker=0, background=context, max_audio_length_ms=5000 max_audio_length_ms=5000 ) # Save the audio import audiofile audiofile.write("reply.wav", audio, 22050)
- Attention:
audio=mx.array([...])
Requires previous audio data. If not, you can generate the audio first with basic generation and then fill it with its result.
- If you want the model to generate responses based on previous conversations, you can add the
- Run and test
- fulfillment
python generate_audio.py
Generating Contextualized Speech Filesreply.wav
The
- fulfillment
parameterization
- Temperature (temp): Controls the randomness of speech. The smaller the value (e.g. 0.5), the more stable the speech; the larger the value (e.g. 1.0), the more varied the speech.
- Maximum length (max_audio_length_ms): The unit is milliseconds, e.g. 5000 for 5 seconds.
- Adjustment method: in
make_sampler
maybegenerate
function to change the parameters and then re-run the script.
caveat
- If you are experiencing memory problems when generating audio, try reducing the size of the
max_audio_length_ms
The - Ensure that you have a good internet connection, as the first run of the model requires the weights file to be downloaded, which is around a few GB in size.
application scenario
- Educational aids
Users can use csm-mlx to generate speech explanations for teaching content. For example, input the text and generate natural speech for listening practice. - Virtual Assistant Development
Developers can utilize csm-mlx to build intelligent voice assistants. Combined with the dialog context feature, the assistant can generate coherent responses based on the user's words. - content creation
Podcast producers can use it to convert scripts to speech, quickly generate audio clips and save recording time.
QA
- Does csm-mlx support Chinese?
Yes, it supports Chinese input and generates Chinese speech. However, the effect depends on the training data, and it is recommended to test specific utterances to confirm the quality. - How much hard disk space is required?
The model files are about 1-2 GB, plus the dependency libraries and generated files, it is recommended to reserve 5 GB of space. - Will it work on Windows?
No, csm-mlx is designed for Apple silicon, relies on the MLX framework, and currently only supports macOS.