csm-mlx: csm speech generation model for Apple devices

Latest AI Resources5mos agorelease AI Sharing Circle

1.2K 00

General Introduction

csm-mlx is based on the MLX framework developed by Apple, optimized for the CSM (Conversation Speech Model) voice conversation model specifically for the Apple Silicon. This project allows users to run efficient speech generation and dialog functions on Apple devices in a simple way. Developer senstella released this project on March 15, 2025 with the goal of getting more people to take advantage of the power of Apple devices and explore speech technology. The core of the project is to provide a lightweight, easy-to-use tool that supports generating natural speech and processing dialog scenarios.

Function List

Speech Generation: Generate natural human voice audio after inputting text.
Conversation context support: Generate coherent voice replies based on the content of previous conversations.
Apple device optimization: efficiently running models on Apple silicon using the MLX framework.
Open source model loading: Support for downloading pre-trained models from Hugging Face (e.g. csm-1b).
Adjustable Parameters: provides sampler parameter adjustments such as temperature (temp) and minimum probability (min_p) to control the generation effect.

Using Help

Installation process

To use csm-mlx locally, you need to install some dependent tools and environments first. Below are the detailed steps:

Preparing the environment
- Make sure you're using macOS and that the device is powered by Apple silicon (e.g. M1, M2 chips).
- Install Python 3.10 or later. You can install Python with the command brew install python@3.10 Installation via Homebrew.
- Install Git, run brew install git(can be skipped if already installed).
cloning project
- Open a terminal and enter the following command to download the csm-mlx project:
```
git clone https://github.com/senstella/csm-mlx.git
```
- Go to the project folder:
```
cd csm-mlx
```
Creating a Virtual Environment
- Create a Python virtual environment in the project directory:
```
python3.10 -m venv .venv
```
- Activate the virtual environment:
```
source .venv/bin/activate
```
Installation of dependencies
- Install the Python packages needed for the project and run it:
```
pip install -r requirements.txt
```
- Note: You need to make sure that the MLX framework and Hugging Face are installed. huggingface_hub library. If you encounter problems, you can run a separate pip install mlx huggingface_hubThe
Download model
- csm-mlx using pre-trained models csm-1b-mlx. Run the following code to download automatically:
```
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='senstella/csm-1b-mlx', filename='ckpt.safetensors')"
```
- The model files are saved in the local cache directory (usually the ~/.cache/huggingface/hub).

How to use

Once installed, you can run csm-mlx's speech generation feature with a Python script. Here are the steps to do so:

Basic Speech Generation

Writing scripts

Create a file in the project directory, such as generate_audio.py, enter the following code:

from csm_mlx import CSM, csm_1b, generate
from mlx_lm.sample_utils import make_sampler
from huggingface_hub import hf_hub_download
# 初始化模型
csm = CSM(csm_1b())
weight = hf_hub_download(repo_id="senstella/csm-1b-mlx", filename="ckpt.safetensors")
csm.load_weights(weight)
# 生成音频
audio = generate(
csm,
text="你好，我是 csm-mlx。",
speaker=0,
context=[],
max_audio_length_ms=10000,  # 最大音频长度 10 秒
sampler=make_sampler(temp=0.5, min_p=0.1)
)
# 保存音频
import audiofile
audiofile.write("output.wav", audio, 22050)  # 22050 是采样率

Note: Saving audio requires the installation of audiofile library, run the pip install audiofileThe

Running Scripts
- Enter it in the terminal:
```
python generate_audio.py
```
- Running it generates the following in the current directory output.wav file, double-click it to play it.

Adding Context to a Conversation

Modifying the Script Support Context

If you want the model to generate responses based on previous conversations, you can add the context Parameters. The modification code is as follows:

from csm_mlx import CSM, csm_1b, generate, Segment
import mlx.core as mx
from huggingface_hub import hf_hub_download
# 初始化模型
csm = CSM(csm_1b())
weight = hf_hub_download(repo_id="senstella/csm-1b-mlx", filename="ckpt.safetensors")
csm.load_weights(weight)
# 创建对话上下文
context = [
Segment(speaker=0, text="你好，今天天气怎么样？", audio=mx.array([...])),
Segment(speaker=1, text="很好，阳光明媚。", audio=mx.array([...]))
]
# 生成回复
audio = generate(
csm,
text="那我们出去走走吧！",
speaker=0,
context=context,
max_audio_length_ms=5000
)
# 保存音频
import audiofile
audiofile.write("reply.wav", audio, 22050)

Attention:audio=mx.array([...]) Requires previous audio data. If not, you can generate the audio first with basic generation and then fill it with its result.

Run and test
- fulfillment python generate_audio.pyGenerating Contextualized Speech Files reply.wavThe

parameterization

Temperature (temp): Controls the randomness of speech. The smaller the value (e.g. 0.5), the more stable the speech; the larger the value (e.g. 1.0), the more varied the speech.
Maximum length (max_audio_length_ms): The unit is milliseconds, e.g. 5000 for 5 seconds.
Adjustment method: in make_sampler maybe generate function to change the parameters and then re-run the script.

caveat

If you are experiencing memory problems when generating audio, try reducing the size of the max_audio_length_msThe
Ensure that you have a good internet connection, as the first run of the model requires the weights file to be downloaded, which is around a few GB in size.

application scenario

Educational aids
Users can use csm-mlx to generate speech explanations for teaching content. For example, input the text and generate natural speech for listening practice.
Virtual Assistant Development
Developers can utilize csm-mlx to build intelligent voice assistants. Combined with the dialog context feature, the assistant can generate coherent responses based on the user's words.
content creation
Podcast producers can use it to convert scripts to speech, quickly generate audio clips and save recording time.

QA

Does csm-mlx support Chinese?
Yes, it supports Chinese input and generates Chinese speech. However, the effect depends on the training data, and it is recommended to test specific utterances to confirm the quality.
How much hard disk space is required?
The model files are about 1-2 GB, plus the dependency libraries and generated files, it is recommended to reserve 5 GB of space.
Will it work on Windows?
No, csm-mlx is designed for Apple silicon, relies on the MLX framework, and currently only supports macOS.