Fish Agent: end-to-end AI voice cloning assistant, real-time voice conversation assistant, Fish Speech spin-off project

Latest AI Resources7mos agorelease AI Sharing Circle

2.2K 00

General Introduction

Fish Speech Derived Project Fish Agent is a revolutionary end-to-end AI speech cloning system developed based on V0.1 3B model architecture. As a fully end-to-end speech cloning processing system, its most important feature is that it is designed with an innovative semantic tagless architecture, which does not need to rely on traditional semantic encoders/decoders such as Whisper, and can directly realize speech-to-speech conversion. With ultra-low latency (as low as 150 ms), the system is able to accurately capture and generate ambient audio information to achieve near real-time speech cloning effects.Fish Agent is open to pre-trained model downloads and supports local deployment for training and cloud service invocation, providing developers and users with a flexible usage scheme. With integrated speech recognition and speech synthesis functions, along with a precise tone control system, Fish Agent is able to create a natural and smooth voice interaction experience.

End-to-end architecture, zero-sample sound cloning, compact model with 3 billion parameters, support for multilingualism and fast response. Training data includes 700,000 hours of multilingual audio. Based on Qwen-2.5-3B-Instruct continued pre-training. The model, named Fish Agent version 3B, automatically integrates ASR and TTS components, eliminating the need for external models and enabling true end-to-end processing, distinguishing it from the traditional three-stage (ASR + LLM + TTS) process.

Fish Agent：端到端AI语音克隆助手，实时语音对话助理，Fish Speech衍生项目

Experience: https://huggingface.co/spaces/fishaudio/fish-agent

Function List

Ultra-low latency voice cloning: 150 ms response time, supports real-time voice conversion
Semantic-free markup architecture: an innovative end-to-end speech processing solution
Precision Tone Control: Precision tone adjustment via reference audio
Ambient audio processing: high-fidelity reproduction of environmental sound information
Open pre-trained models: support for localized deployment and training
Cloud Service API: Provide convenient cloud interface calls
Personalized training: supports custom sound model training

Using Help

1. System requirements

Python 3.8 or higher
NVIDIA GPU (recommended)
8GB or more of system memory
CUDA support (recommended)

2. Installation steps

environmental preparation

# 创建虚拟环境
python -m venv fish-agent-env
source fish-agent-env/bin/activate  # Linux/Mac
# 或
fish-agent-env\Scripts\activate  # Windows

Installing Fish Agent

# 直接安装
pip install fish-agent
# 或从源码安装
git clone https://github.com/fishaudio/fish-agent
cd fish-agent
pip install -e .

3. Utilization process

3.1 Online service utilization

You can now try our SmartBody demo online by following the documentation for live English chat as well as local English and Chinese chat.

The demo is an early alpha test version, the inference speed needs to be optimized, and there are many bugs to be fixed. if you find a bug or want to fix it, we're happy to take questions or pull requests.

https://fish.audio/zh-CN/demo/live/

3.2 Local deployment

service activation

from fish_agent import VoiceAgent
# 初始化Fish Agent
agent = VoiceAgent()
# 启动本地服务
agent.start_server(port=7860)

Speech Cloning Example

# 加载参考音频
reference_audio = "path/to/reference.wav"
agent.load_reference(reference_audio)
# 生成克隆语音
text = "这是一段测试语音"
output_path = "output.wav"
agent.generate_speech(text, output_path)

Real-time conversion settings

# 启动实时语音转换
agent.start_realtime_conversion(
input_device=0,  # 输入设备ID
output_device=1, # 输出设备ID
reference_audio="path/to/reference.wav"
)

4. Advanced feature configuration

4.1 Tone Parameter Adjustment

Tone control parameters:
- Pitch: -12 to 12
- Speed of speech: 0.5 to 2.0
- Emotion_intensity: 0 to 1.0

4.2 Batch processing

# 批量文本处理
texts = ["文本1", "文本2", "文本3"]
agent.batch_process(texts, output_dir="outputs/")

4.3 API calls

# API调用示例
import requests
url = "https://speech.fish.audio/api/v1/generate"
payload = {
"text": "要转换的文本",
"reference_audio": "base64编码的音频文件"
}
response = requests.post(url, json=payload)

5. Precautions for use

Reference audio quality has a significant impact on cloning results, and it is recommended to use clear recordings without background noise
It is recommended that the text be limited to 200 words in a single processing.
Real-time conversion requires a good microphone for better results
Commercial use requires specific authorization
It is recommended to update the model regularly for optimal performance

6. Resolution of common problems

Audio output issues
- Checking Audio Output Device Settings
- Verify system volume configuration
- Confirm audio format support
performance optimization
- Verify that the GPU is properly enabled
- Adjusting batch parameters
- Regular Cache Cleaning
Installation Related
- Verifying Python Version Compatibility
- Confirm CUDA environment configuration
- Consider a conda environment
API Usage
- Check network connection status
- Confirming API Permission Configuration
- Verify server response