AI Personal Learning
and practical guidance
Beanbag Marscode1

csm-mlx: csm speech generation model for Apple devices

General Introduction

csm-mlx is based on the MLX framework developed by Apple, optimized for the CSM (Conversation Speech Model) voice conversation model specifically for the Apple Silicon. This project allows users to run efficient speech generation and dialog functions on Apple devices in a simple way. Developer senstella released this project on March 15, 2025 with the goal of getting more people to take advantage of the power of Apple devices and explore speech technology. The core of the project is to provide a lightweight, easy-to-use tool that supports generating natural speech and processing dialog scenarios.

csm-mlx: csm speech generation model for Apple devices-1


 

Function List

  • Speech Generation: Generate natural human voice audio after inputting text.
  • Conversation context support: Generate coherent voice replies based on the content of previous conversations.
  • Apple device optimization: efficiently running models on Apple silicon using the MLX framework.
  • Open source model loading: Support for downloading pre-trained models from Hugging Face (e.g. csm-1b).
  • Adjustable Parameters: provides sampler parameter adjustments such as temperature (temp) and minimum probability (min_p) to control the generation effect.

 

Using Help

Installation process

To use csm-mlx locally, you need to install some dependent tools and environments first. Below are the detailed steps:

  1. Preparing the environment
    • Make sure you're using macOS and that the device is powered by Apple silicon (e.g. M1, M2 chips).
    • Install Python 3.10 or later. You can install Python with the command brew install python@3.10 Installation via Homebrew.
    • Install Git, run brew install git(can be skipped if already installed).
  2. cloning project
    • Open a terminal and enter the following command to download the csm-mlx project:
      git clone https://github.com/senstella/csm-mlx.git
      
    • Go to the project folder:
      cd csm-mlx
      
  3. Creating a Virtual Environment
    • Create a Python virtual environment in the project directory:
      python3.10 -m venv .venv
      
    • Activate the virtual environment:
      source .venv/bin/activate
      
  4. Installation of dependencies
    • Install the Python packages needed for the project and run it:
      pip install -r requirements.txt
      
    • Note: You need to make sure that the MLX framework and Hugging Face are installed. huggingface_hub library. If you encounter problems, you can run a separate pip install mlx huggingface_hubThe
  5. Download model
    • csm-mlx using pre-trained models csm-1b-mlx. Run the following code to download automatically:
      python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='senstella/csm-1b-mlx', filename='ckpt.safetensors')"
      
    • The model files are saved in the local cache directory (usually the ~/.cache/huggingface/hub).

How to use

Once installed, you can run csm-mlx's speech generation feature with a Python script. Here are the steps to do so:

Basic Speech Generation

  1. Writing scripts
    • Create a file in the project directory, such as generate_audio.py, enter the following code:
      from csm_mlx import CSM, csm_1b, generate
      from mlx_lm.sample_utils import make_sampler
      from huggingface_hub import hf_hub_download
      # Initialize the model
      csm = CSM(csm_1b())
      weights = hf_hub_download(repo_id="senstella/csm-1b-mlx", filename="ckpt.safetensors")
      csm.load_weights(weight)
      # Generate audio
      audio = generate(
      csm.
      text="Hello, this is csm-mlx.",
      speaker=0,
      context=[],
      max_audio_length_ms=10000, # Maximum audio length 10 seconds
      sampler=make_sampler(temp=0.5, min_p=0.1)
      )
      # save audio
      import audiofile
      audiofile.write("output.wav", audio, 22050) # 22050 is sample rate
      
    • Note: Saving audio requires the installation of audiofile library, run the pip install audiofileThe
  2. Running Scripts
    • Enter it in the terminal:
      python generate_audio.py
      
    • Running it generates the following in the current directory output.wav file, double-click it to play it.

Adding Context to a Conversation

  1. Modifying the Script Support Context
    • If you want the model to generate responses based on previous conversations, you can add the context Parameters. The modification code is as follows:
      from csm_mlx import CSM, csm_1b, generate, Segment
      import mlx.core as mx
      from huggingface_hub import hf_hub_download
      # Initialize the model
      csm = CSM(csm_1b())
      weights = hf_hub_download(repo_id="senstella/csm-1b-mlx", filename="ckpt.safetensors")
      csm.load_weights(weight)
      # Create a dialog context
      context = [
      Segment(speaker=0, text="Hello, what's the weather like today?" , audio=mx.array([...])) ,
      Segment(speaker=1, text="It's nice and sunny." , audio=mx.array([...]))
      ]
      # Generate a reply
      audio = generate(
      csm.
      text="Let's go for a walk then!" ,
      speaker=0,
      background=context, max_audio_length_ms=5000
      max_audio_length_ms=5000
      )
      # Save the audio
      import audiofile
      audiofile.write("reply.wav", audio, 22050)
      
    • Attention:audio=mx.array([...]) Requires previous audio data. If not, you can generate the audio first with basic generation and then fill it with its result.
  2. Run and test
    • fulfillment python generate_audio.pyGenerating Contextualized Speech Files reply.wavThe

parameterization

  • Temperature (temp): Controls the randomness of speech. The smaller the value (e.g. 0.5), the more stable the speech; the larger the value (e.g. 1.0), the more varied the speech.
  • Maximum length (max_audio_length_ms): The unit is milliseconds, e.g. 5000 for 5 seconds.
  • Adjustment method: in make_sampler maybe generate function to change the parameters and then re-run the script.

caveat

  • If you are experiencing memory problems when generating audio, try reducing the size of the max_audio_length_msThe
  • Ensure that you have a good internet connection, as the first run of the model requires the weights file to be downloaded, which is around a few GB in size.

 

application scenario

  1. Educational aids
    Users can use csm-mlx to generate speech explanations for teaching content. For example, input the text and generate natural speech for listening practice.
  2. Virtual Assistant Development
    Developers can utilize csm-mlx to build intelligent voice assistants. Combined with the dialog context feature, the assistant can generate coherent responses based on the user's words.
  3. content creation
    Podcast producers can use it to convert scripts to speech, quickly generate audio clips and save recording time.

 

QA

  1. Does csm-mlx support Chinese?
    Yes, it supports Chinese input and generates Chinese speech. However, the effect depends on the training data, and it is recommended to test specific utterances to confirm the quality.
  2. How much hard disk space is required?
    The model files are about 1-2 GB, plus the dependency libraries and generated files, it is recommended to reserve 5 GB of space.
  3. Will it work on Windows?
    No, csm-mlx is designed for Apple silicon, relies on the MLX framework, and currently only supports macOS.
May not be reproduced without permission:Chief AI Sharing Circle " csm-mlx: csm speech generation model for Apple devices

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish