ZipVoice - Xiaomi's open source speech synthesis model series

Latest AI Resources6mos agoupdate AI Sharing Circle

41.8K 00

What is ZipVoice

ZipVoice is a series of speech synthesis (TTS) models based on the Flow Matching architecture released by Xiaomi, including ZipVoice (a zero-sample single-speaker speech synthesis model) and ZipVoice-Dialog (a zero-sample conversational speech synthesis model).The ZipVoice series of models achieves lightweight modeling and inference acceleration by introducing the Zipformer architecture, which was originally designed for Automatic Speech Recognition (ASR) The Zipformer architecture, originally designed for Automatic Speech Recognition (ASR), is introduced as the backbone network, enabling lightweight modeling and inference acceleration. Compared to DiT-based speech synthesis models, ZipVoice reduces the amount of parameters by about 63% with similar performance. the series of models excel in objective metrics such as speaker similarity, word error rate, and UTMOS, as well as subjective metrics such as CMOS, SMOS, and so on, reaching the SOTA performance level of zero-sample speech synthesis models.

Features of ZipVoice

Lightweight & FastZipVoice-Distill: With only 123M model parameters, ZipVoice-Distill is fast and suitable for deployment in resource-constrained environments. The distilled version, ZipVoice-Distill, is faster, with a slight performance degradation, but still maintains high quality output.
High-quality speech cloning: It excels in speaker similarity, speech naturalness, and intelligibility, generating speech that is highly similar to the original speaker, and is suitable for a wide range of voice interaction scenarios.
Multi-language support: It supports both Chinese and English languages, which can meet the needs of speech synthesis in different language environments.
Multi-modal speech generationZipVoice-Dialog generates two-person conversations in mono, while ZipVoice-Dialog-Stereo generates stereo conversations by assigning separate channels to each speaker, enhancing the naturalness and immersion of the conversation.
Open Source and Ease of Use: The model files, training code and inference code have been fully open-sourced to facilitate secondary development and customization by developers. Detailed installation and usage guidelines are provided to lower the threshold of use.

ZipVoice's Core Advantages

Efficient modeling based on Zipformer: For the first time, the Zipformer architecture, originally designed for Automatic Speech Recognition (ASR), has been introduced to the TTS task, an innovation that has enabled ZipVoice to dramatically reduce the number of model parameters and improve the efficiency of model training and inference while maintaining high quality speech synthesis.
average upsampling strategy: The average up-sampling speech-text alignment method is used to simplify the alignment process, improve the stability and convergence speed of the alignment, effectively enhance the intelligibility and clarity of speech, and make the synthesized speech more accurately reflect the content of the input text.
Stream distillation method: The stream distillation technique further reduces the sampling steps and eliminates the inference overhead associated with unclassified classifier guidance, so that the model can achieve near real-time speed (RTF≈1) in a single thread of the CPU using PyTorch code inference, which greatly improves the real-time performance of speech synthesis.
small but rapidZipVoice-Distill: With only 123M model parameters, ZipVoice-Distill is fast inference and suitable for deployment in resource-constrained environments. Its distilled version, ZipVoice-Distill, is even faster, with a slight performance degradation, but still maintains high quality output.
High-quality speech cloning: It excels in speaker similarity, speech naturalness, and intelligibility, and is able to generate speech that is highly similar to the original speaker, which is applicable to a wide range of voice interaction scenarios.

What is the official website of ZipVoice

Project website:: https://zipvoice.github.io/
GitHub repository:: https://github.com/k2-fsa/ZipVoice
HuggingFace Model Library:: https://huggingface.co/k2-fsa/ZipVoice
arXiv Technical Paper:: https://arxiv.org/pdf/2506.13053

Who ZipVoice is for

Speech technology developers: ZipVoice open-source model files, training code and inference code provide developers with a wealth of resources for secondary development and customization, suitable for people with a certain technical foundation who wish to innovate and develop in the field of speech synthesis.
artificial intelligence researcher: Innovative practices of the model in the field of speech synthesis, such as the introduction of the Zipformer architecture and the adoption of the stream distillation method, provide researchers with new research ideas and experimental platforms, which help to promote the development of speech synthesis technology.
Intelligent voice assistant developers: ZipVoice's high-quality speech synthesis capability and multi-language support can provide a more natural and smooth voice interaction experience for intelligent voice assistants, which is suitable for developing intelligent voice assistant applications for users of different languages.
Audio content creatorsZipVoice can quickly generate high-quality speech for audio books, radio dramas, and other audio content, helping creators increase content production efficiency and reduce production costs.
Practitioners in the field of educationIn language learning and online education scenarios, ZipVoice can generate standardized voice demonstrations to assist teaching and help students better learn language pronunciation and expression, which is suitable for scenarios that require multilingual teaching support.
Enterprise Application DevelopersZipVoice's lightweight design and fast reasoning capabilities enable it to be efficiently deployed in enterprise environments and enhance user experience for companies that need to integrate speech synthesis into enterprise-level applications, such as customer service systems and smart tours.