Fun-Audio-Chat-8B - Ali Tongyi Open Source End-to-End Speech Interaction Grand Modeling

Latest AI Resources4mos agorelease AI Sharing Circle

27.9K 00

What is Fun-Audio-Chat-8B?

Fun-Audio-Chat-8B is an open source 8 billion parameter end-to-end speech big model of Ali Tongyi team, direct speech in speech out, no need for ASR+LLM+TTS splicing, bilingual fluent in Chinese and English, low latency, and natural timbre. Using dual-resolution shared LLM with 25Hz high-fidelity speech decoding, GPU overhead is reduced by half; Core-Cocktail two-phase training first injects speech capability and then fuses text parameters to suppress forgetfulness; multitasking preference alignment allows the model to listen to emotions and understand commands. In OpenAudioBench, VoiceBench and other more than ten authoritative lists of the same size first, can be deployed to do voice chat, emotional accompaniment, intelligent terminals or customer service, 24G video memory can be inference, the code and weights have been synchronized to ModelScope, HuggingFace and GitHub.

Features of Fun-Audio-Chat-8B

End-to-end S2S architecture: Generate speech output directly from speech input without ASR + LLM + TTS splicing, higher efficiency and lower latency.
Dual resolution designThe Shared LLM layer is processed efficiently at a 5Hz frame rate, and SRH generates high quality speech at a 25Hz frame rate, reducing the GPU computational overhead by nearly 50%.
Core-Cocktail Two-Stage Training StrategyThe problem of "catastrophic forgetting" is mitigated through the phased introduction of speech and multimodal capabilities, and then fine-tuned by fusion with the parameters of the original text model.
Multi-stage, multi-task preference alignment training: Enabling the model to capture semantic and emotional cues more accurately in real speech conversations, and to enhance the naturalness of conversations.

Core Advantages of Fun-Audio-Chat-8B

End-to-end S2S: Direct voice in and out, no ASR+LLM+TTS splicing, latency halved.
8 billion bilingual parameters: More than ten firsts in the same scale list, understand and speak and emotional perception quasi.
Dual Resolution Architecture: 5Hz shared LLM + 25Hz hi-fi decoding, save half of GPU math.
Core-Cocktail Training: Injecting speech before fusing text to inhibit catastrophic forgetting.
Preference Alignment Multitasking: Listening to emotions, changing styles in response to commands, and dramatically improving the naturalness of conversations.
One Click Open Source: ModelScope/HuggingFace/GitHub full link code and weights, 24G video memory can be reasoned, ten minutes to deploy voice chat, emotional accompaniment, intelligent terminal, customer service and other scenes.

What is the official website for Fun-Audio-Chat-8B?

Project website:: https://funaudiollm.github.io/funaudiochat/
Github repository:: https://github.com/FunAudioLLM/Fun-Audio-Chat
HuggingFace Model Library: https: //huggingface.co/FunAudioLLM/Fun-Audio-Chat-8B
Technical Papers:: https://github.com/FunAudioLLM/Fun-Audio-Chat/blob/main/Fun-Audio-Chat-Technical-Report.pdf

People for whom Fun-Audio-Chat-8B is intended

maker of intelligent hardware: Quickly add low-latency, high-intelligence voice dialog capabilities to speakers, headphones, cars, and home appliances.
Social and Emotional Companionship Entrepreneurs: Build applications such as AI chat, virtual lovers, and healing assistants with natural timbre and emotion perception.
Customer Service & Call Center: Replace traditional TTS+ASR solution to realize end-to-end voice Q&A and reduce deployment and O&M costs.
Education and Language Learning Platform: Provide real-time bilingual pronunciation assessment, speaking pair practice, pronunciation correction to enhance the interactive experience.
Accessible Developers: Create high-fluency voice interaction tools for visually impaired or dyslexic people to improve information accessibility.
Research and Algorithm Engineer: Explore the frontiers of speech macromodeling based on open-source weights with complete training code and a low threshold for secondary innovation.