Ultravox: an audio multimodal macromodel for real-time end-to-end voice dialog, an open source implementation of GPT-4o voice interaction

Latest AI Resources8mos agorelease AI Sharing Circle

General Introduction

Ultravox is an innovative multimodal Large Language Model (LLM) designed for real-time speech processing. Unlike traditional speech recognition systems, Ultravox eliminates the need for a separate Audio Speech Recognition (ASR) stage, and is able to directly convert audio to text in high-dimensional space. This feature gives Ultravox a significant advantage in terms of responsiveness and processing efficiency. trained on models such as Llama 3, Mistral and Gemma, Ultravox understands both text and human speech, and in the future will be able to natively understand temporal and emotional cues in speech. The current version of Ultravox takes about 150 milliseconds to generate text for the first time when processing audio content, generating about 60 tokens per second.

Ultravox：实时端到端语音对话的音频多模态大模型，GPT-4o语音交互的开源实现

Function List

Real-time speech processing: Converts audio to text directly without a separate ASR stage.
Multimodal support: able to understand text and speech, and in the future will support emotional and temporal cues.
Efficient response: first text generation takes about 150 milliseconds and generates about 60 tokens per second.
Compatible with a wide range of models: training based on models such as Llama 3, Mistral and Gemma.
Open source project: code and model weights are available on GitHub and Hugging Face.
Demo and API: Provide Gradio demo and hosted API for users to get started quickly.

Using Help

Installation process

Environmental settings::
- For Mac users, Homebrew is recommended for installation. Run the following command to install Homebrew:
```
 /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- Update Homebrew and install the necessary tools:
```
 brew update
brew install just
```
cloning project::
- Use the following command to clone the Ultravox project:
```
 git clone https://github.com/fixie-ai/ultravox.git
cd ultravox
```
Installation of dependencies::
- Use the following command to install project dependencies: bash pip install -r requirements.txt

Usage Process

Running Demo::
- Ultravox provides a Gradio demo, users can run a local demo with the following command:
```
 gradio --voice_mode=True
```
- Visit the local URL provided to experience Ultravox's real-time voice processing.
Using the API::
- Ultravox provides a set of hosted APIs to which users can gain access by following the steps below:
  - Visit Ultravox's API page to register and get an API key.
  - Call Ultravox's real-time voice processing service using an API key.
Training customized models::
- Users can train their own Ultravox models as needed. Detailed training steps and configuration files can be found in the README file of the project.

Main function operation flow

Real-Time Speech Processing::
- Record or upload an audio file and Ultravox will automatically convert the audio to text.
- Streaming processing is supported and users can view conversion results in real time.
multimodal support::
- Enter text or speech, and Ultravox is able to understand and process multiple forms of input.
- Future versions will support native understanding of emotional and temporal cues.
Efficient response::
- Ultravox processes audio content in approximately 150 milliseconds for the first text generation and generates approximately 60 markers per second, ensuring efficient real-time response.