Hibiki: a real-time speech translation model, streaming translation that preserves the characteristics of the original voice

Latest AI Resources6mos agoupdate AI Sharing Circle

1.6K 00

General Introduction

Hibiki is a high-fidelity real-time speech translation model developed by Kyutai Labs. Unlike traditional offline translators, Hibiki generates natural speech translations in the target language and provides text translations in real-time while the user is speaking. The model utilizes a multi-stream architecture that simultaneously processes the input speech stream and generates the target speech, ensuring consistent and accurate translation.Hibiki aligns the source and target speech and text through supervised training, and utilizes synthetic data generation techniques to ensure high-quality translations with limited real-world data.

Hibiki relies on supervised training of aligned source and target speech and text from the same speaker. Due to the insufficient amount of such data, we rely on synthetic data generation. Word-level matching between source and target transcripts is performed using a weakly supervised approach of contextual alignment using the off-the-shelf MADLAD machine translation system. The derived alignment rules (a word appears in the target language only when it can be predicted from the source language) are applied by inserting silence or synthesizing the target speech using voice-controlled, alignment-aware TTS.

Function List

real-time speech translation: Generate a natural speech translation of the target language in real time while the user is speaking.
text translation: Provides text translation synchronized with speech.
multistream architecture (computing): Simultaneously processes the input speech stream and generates the target speech to ensure coherent and accurate translation.
high fidelity: Ensure high quality of translation through supervised training and synthetic data generation techniques.
phonetic transference: Optional voice transfer function for a more natural translation voice.

Using Help

Installation process

PyTorch

mounting moshi Package:
```
pip install -U moshi
```

Download the example file:

wget https://github.com/kyutai-labs/moshi/raw/refs/heads/main/data/sample_fr_hibiki_crepes.mp3

Run the translation:
```
python -m moshi.run_inference sample_fr_hibiki_crepes.mp3 out_en.wav --hf-repo kyutai/hibiki-1b-pytorch-bf16
```
- Optional parameters --cfg-coef The default value is 1. The higher the value, the closer the generated speech is to the original speech, and the recommended value is 3.

MLX

mounting moshi_mlx package (requires at least version 0.2.1):
```
pip install -U moshi_mlx
```

Download the example file:

wget https://github.com/kyutai-labs/moshi/raw/refs/heads/main/data/sample_fr_hibiki_crepes.mp3

Run the translation:
```
python -m moshi_mlx.run_inference sample_fr_hibiki_crepes.mp3 out_en.wav --hf-repo kyutai/hibiki-1b-mlx-bf16
```
- Optional parameters --cfg-coef The default value is 1. The higher the value, the closer the generated speech is to the original speech, and the recommended value is 3.

MLX-Swift

kyutai-labs/moshi-swift The repository contains an implementation of MLX-Swift that runs on the iPhone and has been tested on the iPhone 16 Pro. Note that this code is still in the experimental phase.

Rust

go into hibiki-rs Catalog:
```
cd hibiki-rs
```

Download the example file:

wget https://github.com/kyutai-labs/moshi/raw/refs/heads/main/data/sample_fr_hibiki_crepes.mp3

Run the translation:
```
cargo run --features metal -r -- gen sample_fr_hibiki_crepes.mp3 out_en.wav
```
- utilization --features cuda Run on an NVIDIA GPU or use the --features metal Runs on a Mac.

mould

We have released two models for French to English translation:

Hibiki 2B: For PyTorch and MLX with 16 RVQ streams.
Hibiki 1B: For PyTorch and MLX, with 8 RVQ streams, ideal for device-side reasoning.

Model List:

Hibiki 2B for PyTorch (bf16):kyutai/hibiki-2b-pytorch-bf16
Hibiki 1B for PyTorch (bf16):kyutai/hibiki-1b-pytorch-bf16
Hibiki 2B for MLX (bf16):kyutai/hibiki-2b-mlx-bf16
Hibiki 1B for MLX (bf16):kyutai/hibiki-1b-mlx-bf16

All models are released under a CC-BY 4.0 license.

Usage Process

priming model: Follow the installation process to start the model.
Input Voice: Inputs speech in the source language through the microphone.
real time translation: Hibiki generates a real-time speech translation in the target language and displays the text translation simultaneously.
Adjustment of settings: Adjust settings such as voice transfer as needed for a more natural translation.

Main Functions

real-time speech translation: After launching the model, input your voice directly through the microphone and Hibiki will automatically translate it.
text translationHibiki generates a text translation that is displayed in the interface at the same time as the voice translation.
phonetic transference: Enable the voice transfer function in the settings to make the translated voice more in line with the natural pronunciation of the target language.

Detailed Operation Procedure

priming model: Start the model following the installation process to ensure that all dependencies have been installed correctly.
Input Voice: Enter your voice in the source language through the microphone and Hibiki will automatically start translating.
View translation results: View real-time generated speech and text translations in the target language on the interface.
Adjustment of settings: Adjust features such as voice transfer in the settings as needed for optimal translation.