AI Personal Learning
and practical guidance
Beanbag Marscode1

Fish Agent: end-to-end AI voice cloning assistant, real-time voice conversation assistant, Fish Speech spin-off project

General Introduction

Fish Speech Derived Project Fish Agent is a revolutionary end-to-end AI speech cloning system developed based on V0.1 3B model architecture. As a fully end-to-end speech cloning processing system, its most important feature is that it is designed with an innovative semantic tagless architecture, which does not need to rely on traditional semantic encoders/decoders such as Whisper, and can directly realize speech-to-speech conversion. With ultra-low latency (as low as 150 ms), the system is able to accurately capture and generate ambient audio information to achieve near real-time speech cloning effects.Fish Agent is open to pre-trained model downloads and supports local deployment for training and cloud service invocation, providing developers and users with a flexible usage scheme. With integrated speech recognition and speech synthesis functions, along with a precise tone control system, Fish Agent is able to create a natural and smooth voice interaction experience.

End-to-end architecture, zero-sample sound cloning, compact model with 3 billion parameters, support for multilingualism and fast response. Training data includes 700,000 hours of multilingual audio. Based on Qwen-2.5-3B-Instruct continued pre-training. The model, named Fish Agent version 3B, automatically integrates ASR and TTS components, eliminating the need for external models and enabling true end-to-end processing, distinguishing it from the traditional three-stage (ASR + LLM + TTS) process.

Fish Agent: Experience end-to-end AI voice cloning assistant, real-time voice conversation assistant (English)-1

Experience: https://huggingface.co/spaces/fishaudio/fish-agent

 

Function List

  • Ultra-low latency voice cloning: 150 ms response time, supports real-time voice conversion
  • Semantic-free markup architecture: an innovative end-to-end speech processing solution
  • Precision Tone Control: Precision tone adjustment via reference audio
  • Ambient audio processing: high-fidelity reproduction of environmental sound information
  • Open pre-trained models: support for localized deployment and training
  • Cloud Service API: Provide convenient cloud interface calls
  • Personalized training: supports custom sound model training

 

Using Help

1. System requirements

  • Python 3.8 or higher
  • NVIDIA GPU (recommended)
  • 8GB or more of system memory
  • CUDA support (recommended)

2. Installation steps

  1. environmental preparation
# 创建虚拟环境
python -m venv fish-agent-env
source fish-agent-env/bin/activate  # Linux/Mac
# 或
fish-agent-env\Scripts\activate  # Windows
  1. Installing Fish Agent
# 直接安装
pip install fish-agent
# 或从源码安装
git clone https://github.com/fishaudio/fish-agent
cd fish-agent
pip install -e .

3. Utilization process

3.1 Online service utilization

You can now try our SmartBody demo online by following the documentation for live English chat as well as local English and Chinese chat.


The demo is an early alpha test version, the inference speed needs to be optimized, and there are many bugs to be fixed. if you find a bug or want to fix it, we're happy to take questions or pull requests.

https://fish.audio/zh-CN/demo/live/

 

3.2 Local deployment

  1. service activation
from fish_agent import VoiceAgent
# 初始化Fish Agent
agent = VoiceAgent()
# 启动本地服务
agent.start_server(port=7860)
  1. Speech Cloning Example
# 加载参考音频
reference_audio = "path/to/reference.wav"
agent.load_reference(reference_audio)
# 生成克隆语音
text = "这是一段测试语音"
output_path = "output.wav"
agent.generate_speech(text, output_path)
  1. Real-time conversion settings
# 启动实时语音转换
agent.start_realtime_conversion(
input_device=0,  # 输入设备ID
output_device=1, # 输出设备ID
reference_audio="path/to/reference.wav"
)

4. Advanced feature configuration

4.1 Tone Parameter Adjustment

  • Tone control parameters:
    • Pitch: -12 to 12
    • Speed of speech: 0.5 to 2.0
    • Emotion_intensity: 0 to 1.0

4.2 Batch processing

# 批量文本处理
texts = ["文本1", "文本2", "文本3"]
agent.batch_process(texts, output_dir="outputs/")

4.3 API calls

# API调用示例
import requests
url = "https://speech.fish.audio/api/v1/generate"
payload = {
"text": "要转换的文本",
"reference_audio": "base64编码的音频文件"
}
response = requests.post(url, json=payload)

5. Precautions for use

  • Reference audio quality has a significant impact on cloning results, and it is recommended to use clear recordings without background noise
  • It is recommended that the text be limited to 200 words in a single processing.
  • Real-time conversion requires a good microphone for better results
  • Commercial use requires specific authorization
  • It is recommended to update the model regularly for optimal performance

6. Resolution of common problems

  1. Audio output issues
    • Checking Audio Output Device Settings
    • Verify system volume configuration
    • Confirm audio format support
  2. performance optimization
    • Verify that the GPU is properly enabled
    • Adjusting batch parameters
    • Regular Cache Cleaning
  3. Installation Related
    • Verifying Python Version Compatibility
    • Confirm CUDA environment configuration
    • Consider a conda environment
  4. API Usage
    • Check network connection status
    • Confirming API Permission Configuration
    • Verify server response
May not be reproduced without permission:Chief AI Sharing Circle " Fish Agent: end-to-end AI voice cloning assistant, real-time voice conversation assistant, Fish Speech spin-off project
en_USEnglish