Orpheus-TTS: Text-to-Speech Tool for Generating Natural Chinese Speech

Latest AI Resources4mos agorelease AI Sharing Circle

1.2K 00

General Introduction

Orpheus-TTS is an open source text-to-speech (TTS) system developed on the Llama-3b architecture with the goal of generating audio close to natural human speech. It is launched by the Canopy AI team and supports multiple languages such as English, Spanish, French, German, Italian, Portuguese and Chinese. The system generates speech with intonation, emotion, and rhythm from text input, and supports non-verbal expressions such as laughter and sighs, making it suitable for real-time conversations, audiobook production, and intelligent assistant development.Orpheus-TTS provides pre-trained and fine-tuned models that allow developers to customize the speech or language according to their needs.

Function List

Generate near-human speech: turn text into audio with natural intonation, emotion, and rhythm, outperforming partially closed-source models.
Zero-sample speech cloning: mimics the timbre of the target speech without additional training.
Emotional and intonation control: through labeling (e.g. <laugh>,<sigh>) Adjusting voice expression to enhance realism.
Low-latency streaming output: real-time speech generation latency is about 200ms, optimized down to 100ms.
Multi-language support: English, Spanish, French, German, Italian, Portuguese, Chinese and other languages.
Model fine-tuning: provides data processing scripts and example datasets to support developers in customizing speech styles or languages.
Running locally and in the cloud: Supported via LM Studio, llama.cpp or vLLM Run locally or in the cloud.
Audio Watermarking: Protect copyrights by watermarking generated audio with Silent Cipher technology.

Using Help

Installation process

Orpheus-TTS requires a Python environment and basic hardware support (GPU recommended). Here are the detailed installation steps:

Cloning Codebase
Download the Orpheus-TTS project using Git:

git clone https://github.com/canopyai/Orpheus-TTS.git
cd Orpheus-TTS

Installation of dependencies
Installation of core packages orpheus-speechIt relies on vLLM for fast reasoning:
```
pip install orpheus-speech
```
Note: The March 18, 2025 version of vLLM may have bugs and it is recommended that a stable version be installed:
```
pip install vllm==0.7.3
```
Other dependencies include transformers,datasets cap (a poem) torchThe installation can be done on request:
```
pip install transformers datasets torch
```
Check hardware requirements
- Python versions: 3.8-3.11 (3.12 not supported).
- GPU: NVIDIA CUDA driver is recommended, with more than 12GB of video memory to ensure smooth operation.
- CPU: Supported by orpheus-cpp running, but with lower performance for lighter tests.
- Network: The first time you run it, you need to download the model, we recommend a stable network connection.
Download model
Orpheus-TTS offers the following models hosted at Hugging Face:
- Fine-tuning the model (canopylabs/orpheus-tts-0.1-finetune-prod): Suitable for daily speech generation tasks.
- Pre-trained models (canopylabs/orpheus-tts-0.1-pretrained): based on 100,000 hours of English speech data training, suitable for advanced tasks such as speech cloning.
- Multilingual models (canopylabs/orpheus-multilingual-research-release): Contains 7 sets of pre-trained and fine-tuned models to support multilingual studies.
  Download method:
```
huggingface-cli download canopylabs/orpheus-tts-0.1-finetune-prod
```

test installation
Run the following Python script to verify that the installation was successful:

from orpheus_tts import OrpheusModel
model = OrpheusModel(model_name="canopylabs/orpheus-tts-0.1-finetune-prod")
prompt = "tara: 你好，这是一个测试！"
audio = model.generate_speech(prompt)
with wave.open("test_output.wav", "wb") as file:
file.setnchannels(1)
file.setsampwidth(2)
file.setframerate(24000)
file.writeframes(audio)

After successfully running, a WAV file will be generated containing the voice of the input text.

Main Functions

1. Generation of near-human speech

The core function of Orpheus-TTS is to convert text into audio that is close to natural human speech. The system supports a wide range of speech characters, in the English model tara the most natural sense of speech, other characters include leah,jess,leo,dan,mia,zac cap (a poem) zoe. The roles of the multilingual model need to refer to the official documentation. Procedure:

Prepare text prompts in the format {角色名}: {文本}For example tara: 你好，今天天气很好！The

Call the generate function:

audio = model.generate_speech(prompt="tara: 你好，今天天气很好！")

The output is audio in WAV format with a sampling rate of 24000 Hz in mono.

2. Emotion and tone control

Users can control the emotional expression of speech through tags to enhance speech realism. Supported tags include:

<laugh>: Laughter.
<sigh>: Sigh.
<cough>: Cough.
<sniffle>: Sniffles.
<groan>: groan.
<yawn>: Yawn.
<gasp>: Surprise.
Example:

prompt = "tara: 这个消息太震撼了！<gasp> 你听说了吗？"
audio = model.generate_speech(prompt)

Generated audio will include a surprise effect at the appropriate place. See the Hugging Face documentation for labeling support for multi-language models.

3. Zero-sample speech cloning

Orpheus-TTS supports zero-sample speech cloning, which can directly imitate the target speech without pre-training. Operation steps:

Prepare a WAV file of the target speech with a recommended duration of 10-30 seconds and a sampling rate of 24000 Hz.

Use the following script:

audio_ref = "path/to/reference.wav"
prompt = "tara: 这段话会模仿你的声音！"
audio = model.generate_with_voice_clone(prompt, audio_ref)

The output audio will be close to the timbre of the reference speech. The pre-trained model performs better in the cloning task.

4. Low-latency streaming output

Orpheus-TTS provides streaming speech generation for real-time conversational scenarios with a latency of about 200ms, optimized down to 100ms. operational example:

from flask import Flask, Response
from orpheus_tts import OrpheusModel
app = Flask(__name__)
[1]
model = OrpheusModel(model_name="canopylabs/orpheus-tts-0.1-finetune-prod")
@app.route("/stream")
def stream_audio():
def generate():
prompt = "tara: 这是实时语音测试！"
for chunk in model.stream_generate(prompt):
yield chunk
return Response(generate(), mimetype="audio/wav")

After running the Flask service, access the http://localhost:5000/stream Real-time voice can be heard. Optimizing latency requires enabling KV caching and input streaming.

5. Model fine-tuning

Developers can fine-tune the model to support new languages or speech styles. The steps are as follows:

Prepare the dataset: Refer to the Hugging Face example dataset for the format (canopylabs/zac-sample-dataset). It is recommended to include 50-300 samples/characters, with 300 samples for best results.
Preprocessing of data: Use of officially provided Colab notebooks (1wg_CPCA-MzsWtsujwy-1Ovhv-tn8Q1nDThe data is processed in about 1 minute/thousand entries.
Configuration training: modification finetune/config.yaml, set the dataset path and parameters.

Run training:

huggingface-cli login
wandb login
accelerate launch train.py

The fine-tuned model can be used for specific tasks such as Chinese speech optimization.

6. Audio watermarking

Orpheus-TTS supports watermarking audio with Silent Cipher technology for copyright protection. Method of operation:

Refer to the official implementation script (additional_inference_options/watermark_audio).

Example:

from orpheus_tts import add_watermark
audio = model.generate_speech(prompt)
watermarked_audio = add_watermark(audio, watermark_id="user123")

7. No GPU reasoning

Users without GPUs can get the most out of their GPUs with the orpheus-cpp Running on a CPU, step by step:

mounting llama.cpp Environment.
Refer to the official documentation (additional_inference_options/no_gpu/README.md).
Lower performance than GPUs for lighter tasks.

caveat

Model Selection: Fine-tuned models are suitable for everyday tasks and pre-trained models are suitable for advanced tasks such as speech cloning.
hardware limitation: Less than 12GB of video memory may result in insufficient memory, it is recommended to check the hardware configuration.
Multi-language support: Non-English languages need to refer to the multilingual model documentation, some languages may need to be fine-tuned.
adjust components during testing: If vLLM reports an error (e.g. vllm._C), try a different version or check for CUDA compatibility.

Complementary functions: community extension

The open source nature of Orpheus-TTS has attracted community contributions, and the following implementation (not fully verified) is officially recommended:

LM Studio Local Client: Run Orpheus-TTS via the LM Studio API (isaiahbjork/orpheus-tts-local).
FastAPI Compatible with Open AI: Provides an Open AI-style API interface (Lex-au/Orpheus-FastAPI).
Gradio WebUI: WSL and CUDA-enabled web interface (Saganaki22/OrpheusTTS-WebUI).
Hugging Face Space: An online experience built by community user MohamedRashad (MohamedRashad/Orpheus-TTS).

application scenario

Intelligent Customer Service Robot
Orpheus-TTS generates natural speech for customer service systems, supporting real-time dialog and emotional expression to enhance the user experience.
For example, e-commerce platforms can integrate Orpheus-TTS to add a friendly tone when responding to customer inquiries.
Audiobook and podcast production
Publishers can turn novels or articles into audiobooks, supporting multiple character and emotion tags and reducing voiceover costs.
Podcast creators can generate dynamic opening lines to increase the appeal of their programs.
Language Learning Tools
The educational app generates standard pronunciation speech to help students practice listening and speaking.
For example, Chinese learners can use the Chinese model to practice Mandarin pronunciation.
Game Character Voiceover
Game developers can generate dynamic dialog for NPCs, supporting multi-language and emotional expression for enhanced immersion.
For example, RPGs can generate unique sounds for different characters.
Accessibility aids
Orpheus-TTS provides real-time reading support for visually impaired users by converting text to speech.
For example, integrating into an e-book reader to read long articles aloud.

QA

What languages does Orpheus-TTS support?
English, Spanish, French, German, Italian, Portuguese and Chinese are supported. The multilingual model covers more languages, see the Hugging Face documentation.
How to optimize real-time voice latency?
Enable KV caching and input streaming to reduce latency from 200ms to 100ms. ensure sufficient GPU performance with at least 12GB of video memory.
What preparation is required for zero sample voice cloning?
Provides 10-30 seconds of reference audio in WAV format with a sampling rate of 24000 Hz for better pre-training of the model.
Can the CPU run Orpheus-TTS?
Yes, use orpheus-cppThe GPUs are not as powerful, but have lower performance than the GPUs, making them suitable for testing or light tasks.
How do I add a watermark to my audio?
Using Silent Cipher technology, call add_watermark function, you need to refer to the official script for implementation.
How much data is needed for fine-tuning?
50 samples for initial results, 300 samples/character for optimal quality. Data should be in Hugging Face format.