SpeechGPT 2.0-preview: an end-to-end anthropomorphic speech dialog grand model for real-time interaction

Latest AI Resources6mos agorelease AI Sharing Circle

1.7K 00

General Introduction

SpeechGPT 2.0-preview is the first anthropomorphic real-time interaction system introduced by OpenMOSS, trained on millions of hours of speech data. With anthropomorphic spoken expression and 100ms low latency response, SpeechGPT 2.0-preview supports natural and smooth real-time interruptions and interactions.SpeechGPT 2.0-preview aligns two modes of speech and text, and demonstrates the ability of precise control and intelligent switching of multi-emotions, multi-styles and multi-tones. It can not only simulate the tone and emotional state of various characters, but also has a variety of voice talents such as poetry recitation, storytelling and dialect speaking. In addition, SpeechGPT 2.0-preview also supports tool invocation, network search and plug-in knowledge base, providing rich voice expressiveness and text capabilities.

Demo address: https://sp2.open-moss.com/

Function List

anthropomorphic colloquial expression
Hundred milliseconds low latency response
Multi-emotion, multi-style, multi-tone control
role-playing ability
Voice talents such as poetry recitation, storytelling, and speaking in tongues
Support for tool calls, network searches and plug-in knowledge base
Efficient Voice Data Crawling System
Versatile and efficient speech data cleaning pipeline
All-aspect multi-granularity speech data annotation system
Joint Semantic-Acoustic Modeling for Ultra-Low Bitrate Streaming Speech Codecs

Using Help

Installation process

Cloning Warehouse:

   git clone https://github.com/OpenMOSS/SpeechGPT-2.0-preview.git
cd SpeechGPT-2.0-preview

Download the model weights (requires git-lfs to be installed):

   git lfs install
git clone https://huggingface.co/fnlp/SpeechGPT-2.0-preview-Codec
git clone https://huggingface.co/fnlp/SpeechGPT-2.0-preview-7B

Prepare the environment:

   pip3 install -r requirements.txt
pip3 install flash-attn==2.7.3 --no-build-isolation

Launch the web demo:

   python3 demo_gradio.py --codec_ckpt_path SpeechGPT-2.0-preview-Codec/sg2_codec_ckpt.pkl --model_path SpeechGPT-2.0-preview-7B/

Functional operation flow

anthropomorphic colloquial expression: SpeechGPT 2.0-preview is able to simulate human's oral expression and provide a natural and smooth conversation experience.
Low latency response: The system responds to user inputs in the hundred millisecond level, enabling real-time interaction.
Multi-emotion, multi-style, multi-tone control: Users can control the emotion, style and timbre of the system through commands, adapting to different conversational scenarios.
role-playing (as a game of chess): The system is able to simulate the tone of voice and emotional state of different characters and is suitable for a variety of application scenarios.
phonological talent: SpeechGPT 2.0-preview enriches conversations with a variety of voice talents such as poetry recitation, storytelling and dialect expression.
Tool calls and network searches: The system supports the calling of external tools and the conduct of networked searches, expanding the functionality of the dialog and access to information.
Plugin Knowledge Base: By accessing an external knowledge base, the system is able to provide more detailed and specialized answers.

usage example

emotional control: The user can enter the command "Tell a joke in a happy tone" and the system will tell the joke in a happy tone.
role-playing (as a game of chess): Enter the command "Simulate a teacher's tone of voice to explain a quadratic function" and the system will explain it in the teacher's tone of voice.
phonological talent: Enter the command "Tell a story in dialect" and the system will tell a story in the specified dialect.

Through the above steps and examples, users can fully experience the powerful functions and diverse application scenarios of SpeechGPT 2.0-preview.