General Introduction
Zonos is an open source speech synthesis and speech cloning tool developed by Zyphra.The Zonos-v0.1 version employs an advanced Transformer and blending model to generate high-quality speech output. The tool supports multiple languages, including English, Japanese, Chinese, French, and German, and offers fine-grained audio quality and emotion control.Zonos' speech cloning feature generates highly natural-looking speech after providing just a few seconds of reference audio. Users can get model weights and sample code via GitHub and try it out on Huggingface.
Function List
- Zero-sample TTS speech cloning: Input text and a 10-30 second speaker sample to generate high-quality speech output.
- Audio Prefix Input: Add text and audio prefixes for richer speaker matching.
- Multi-language support: English, Japanese, Chinese, French and German are supported.
- Audio quality and emotion control: Provides fine-grained control over many aspects of the generated audio, including speaking speed, pitch variation, audio quality, and emotion (e.g., happiness, fear, sadness, and anger).
- Real-time speech generation: Supports real-time generation of high-fidelity speech.
Using Help
Installation process
- cloning project: Run the following command in a terminal to clone the Zonos project:
bash
git clone https://github.com/Zyphra/Zonos.git
cd Zonos
- Installation of dependencies: Use the following command to install the required Python dependencies:
bash
pip install -r requirements.txt
- Download model weights: Download the required model weights from Huggingface and place them in the project directory.
Usage
- Loading Models: Load the Zonos model in the Python environment:
import torch import torchaudio from zonos.model import Zonos from zonos.conditioning import make_cond_dict model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
- Generate Speech: Provide text and speaker samples to generate speech output:
python
wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
speaker = model.make_speaker_embedding(wav, sampling_rate)
cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)
- Using the Gradio Interface: The Gradio interface is recommended for speech generation:
bash
This generates a
uv run gradio_interface.py
# or
python gradio_interface.py
sample.wav
file, saved in the project root directory.
Detailed function operation flow
- Zero-sample TTS speech cloning::
- Input the desired text and a 10-30 second sample of the speaker and the model will generate high quality speech output.
- Audio Prefix Input::
- Add text and audio prefixes for richer speaker matching. For example, whisper audio prefixes can be used to generate whisper effects.
- Multi-language support::
- Select the desired language (e.g., English, Japanese, Chinese, French, or German) and the model will generate speech output in the appropriate language.
- Audio quality and emotion control::
- Use the model's Conditional Settings feature to meticulously control all aspects of the generated audio, including speaking speed, pitch variation, audio quality, and emotion (e.g., happiness, fear, sadness, and anger).
- Real-time speech generation::
- Use the Gradio interface or other real-time generation methods to quickly generate high-fidelity speech.