Zonos: High Quality Speech Synthesis and Speech Cloning Tools

Latest AI Resources6mos agorelease AI Sharing Circle

1.9K 00

General Introduction

Zonos is an open source speech synthesis and speech cloning tool developed by Zyphra.The Zonos-v0.1 version utilizes advanced Transformer Zonos' speech cloning feature generates high-quality speech output after providing reference audio for just a few seconds. The tool supports multiple languages, including English, Japanese, Chinese, French, and German, and provides fine-grained control over audio quality and emotion.Zonos' speech cloning feature can generate highly natural speech after providing just a few seconds of reference audio. Users can get model weights and sample code via GitHub and try it out on Huggingface.

Function List

Zero-sample TTS speech cloning: Input text and a 10-30 second speaker sample to generate high-quality speech output.
Audio Prefix Input: Add text and audio prefixes for richer speaker matching.
Multi-language support: English, Japanese, Chinese, French and German are supported.
Audio quality and emotion control: Provides fine-grained control over many aspects of the generated audio, including speaking speed, pitch variation, audio quality, and emotion (e.g., happiness, fear, sadness, and anger).
Real-time speech generation: Supports real-time generation of high-fidelity speech.

Using Help

Installation process

cloning project: Run the following command in a terminal to clone the Zonos project: bash git clone https://github.com/Zyphra/Zonos.git cd Zonos
Installation of dependencies: Use the following command to install the required Python dependencies: bash pip install -r requirements.txt
Download model weights: Download the required model weights from Huggingface and place them in the project directory.

Usage

Loading Models: Load the Zonos model in the Python environment:

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

Generate Speech: Provide text and speaker samples to generate speech output: python wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3") speaker = model.make_speaker_embedding(wav, sampling_rate) cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us") conditioning = model.prepare_conditioning(cond_dict) codes = model.generate(conditioning) wavs = model.autoencoder.decode(codes).cpu() torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)
Using the Gradio Interface: The Gradio interface is recommended for speech generation: bash uv run gradio_interface.py # 或者 python gradio_interface.py This generates a sample.wav file, saved in the project root directory.

Detailed function operation flow

Zero-sample TTS speech cloning::
- Input the desired text and a 10-30 second sample of the speaker and the model will generate high quality speech output.
Audio Prefix Input::
- Add text and audio prefixes for richer speaker matching. For example, whisper audio prefixes can be used to generate whisper effects.
Multi-language support::
- Select the desired language (e.g., English, Japanese, Chinese, French, or German) and the model will generate speech output in the appropriate language.
Audio quality and emotion control::
- Use the model's Conditional Settings feature to meticulously control all aspects of the generated audio, including speaking speed, pitch variation, audio quality, and emotion (e.g., happiness, fear, sadness, and anger).
Real-time speech generation::
- Use the Gradio interface or other real-time generation methods to quickly generate high-fidelity speech.