General Introduction
SpeechGPT 2.0-preview is the first anthropomorphic real-time interaction system introduced by OpenMOSS, trained on millions of hours of speech data. With anthropomorphic spoken expression and 100ms low latency response, SpeechGPT 2.0-preview supports natural and smooth real-time interruptions and interactions.SpeechGPT 2.0-preview aligns two modes of speech and text, and demonstrates the ability of precise control and intelligent switching of multi-emotions, multi-styles and multi-tones. It can not only simulate the tone and emotional state of various characters, but also has a variety of voice talents such as poetry recitation, storytelling and dialect speaking. In addition, SpeechGPT 2.0-preview also supports tool invocation, network search and plug-in knowledge base, providing rich voice expressiveness and text capabilities.
Function List
- anthropomorphic colloquial expression
- Hundred milliseconds low latency response
- Multi-emotion, multi-style, multi-tone control
- role-playing ability
- Voice talents such as poetry recitation, storytelling, and speaking in tongues
- Support for tool calls, network searches and plug-in knowledge base
- Efficient Voice Data Crawling System
- Versatile and efficient speech data cleaning pipeline
- All-aspect multi-granularity speech data annotation system
- Joint Semantic-Acoustic Modeling for Ultra-Low Bitrate Streaming Speech Codecs
Using Help
Installation process
- Cloning Warehouse:
git clone https://github.com/OpenMOSS/SpeechGPT-2.0-preview.git
cd SpeechGPT-2.0-preview
- Download the model weights (requires git-lfs to be installed):
git lfs install
git clone https://huggingface.co/fnlp/SpeechGPT-2.0-preview-Codec
git clone https://huggingface.co/fnlp/SpeechGPT-2.0-preview-7B
- Prepare the environment:
pip3 install -r requirements.txt
pip3 install flash-attn==2.7.3 --no-build-isolation
- Launch the web demo:
python3 demo_gradio.py --codec_ckpt_path SpeechGPT-2.0-preview-Codec/sg2_codec_ckpt.pkl --model_path SpeechGPT-2.0-preview-7B/
Functional operation flow
- anthropomorphic colloquial expression: SpeechGPT 2.0-preview is able to simulate human's oral expression and provide a natural and smooth conversation experience.
- Low latency response: The system responds to user inputs in the hundred millisecond level, enabling real-time interaction.
- Multi-emotion, multi-style, multi-tone control: Users can control the emotion, style and timbre of the system through commands, adapting to different conversational scenarios.
- role-playing (as a game of chess): The system is able to simulate the tone of voice and emotional state of different characters and is suitable for a variety of application scenarios.
- phonological talent: SpeechGPT 2.0-preview enriches conversations with a variety of voice talents such as poetry recitation, storytelling and dialect expression.
- Tool calls and network searches: The system supports the calling of external tools and the conduct of networked searches, expanding the functionality of the dialog and access to information.
- Plugin Knowledge Base: By accessing an external knowledge base, the system is able to provide more detailed and specialized answers.
usage example
- emotional control: The user can enter the command "Tell a joke in a happy tone" and the system will tell the joke in a happy tone.
- role-playing (as a game of chess): Enter the command "Simulate a teacher's tone of voice to explain a quadratic function" and the system will explain it in the teacher's tone of voice.
- phonological talent: Enter the command "Tell a story in dialect" and the system will tell a story in the specified dialect.
Through the above steps and examples, users can fully experience the powerful functions and diverse application scenarios of SpeechGPT 2.0-preview.