Kokoro WebGPU: A Text-to-Speech Service for Offline Operation in Browsers

Latest AI Resources6mos agoupdate AI Sharing Circle

2.6K 00

General Introduction

Kokoro WebGPU is the WebGPU version of the Kokoro text-to-speech (TTS) model, provided by WebML Community on the Hugging Face platform. The project utilizes WebGPU technology to enable users to run efficient text-to-speech conversions locally in their browsers.WebGPU is a modern graphics and computation API that enables high-performance computational tasks in browsers.The Kokoro WebGPU model is designed to provide users with fast and reliable text-to-speech services for a variety of application scenarios, such as audiobooks, podcasts educational videos, etc.

Kokoro is an open source TTS model with 82 million parameters. Despite its lightweight architecture, Kokoro is comparable in speech synthesis quality to larger models, with speed and cost advantages.Kokoro is licensed under the Apache 2.0 license, which allows it to be freely deployed in a variety of environments, both in production and for personal projects.

Function List

Efficient Text-to-Speech: High-performance text-to-speech conversion using WebGPU technology.
The browser runs: No additional software to install, runs directly in your browser.
Multi-language support: Supports text-to-speech conversion in multiple languages.
real time response: Provides fast response times for real-time applications.
Open Source Community Support: Supported by WebML Community, users can participate in community discussions and development.

Using Help

Sample code to run in a browser

To use Kokoro in your browser, you first need to install the kokoro-js Coop.

npm install kokoro-js

The voice can then be generated using the following code:

import { KokoroTTS } from"kokoro-js";

const model_id = "onnx-community/Kokoro-82M-v1.0-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
dtype: "q8", // 可选值: "fp32", "fp16", "q8", "q4", "q4f16"
device: "wasm", // 可选值: "wasm", "webgpu" (web) 或 "cpu" (node)。如果使用 "webgpu"，建议使用 dtype="fp32"。
});

const text = "生活就像一盒巧克力，你永远不知道你会得到什么。";
const audio = await tts.generate(text, {
// 使用 `tts.list_voices()` 列出所有可用的声音
voice: "af_heart",
});
audio.save("audio.wav");

The above code runs in a browser environment and utilizes WebGPU technology for efficient speech synthesis.

Python Code

In a Python environment, you can use the kokoro library for speech synthesis.

# 安装 kokoro 和 soundfile
!pip install kokoro>=0.7.11 soundfile
# 安装 espeak-ng，用于英语 OOD 回退和某些非英语语言
!apt-get -qq -y install espeak-ng > /dev/null 2>&1

from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf

# 初始化管道
pipeline = KPipeline(lang_code='a') # 'a' 表示美式英语

text = '''
天空的颜色像电视机调到没有信号的频道。
"这不像我在使用，"凯斯听到有人说，他挤过 Chat 门口的拥挤人群。"这就像我的身体产生了巨大的药物缺乏症。"
那是一个大都市的声音和笑话。Chatsubo 是一个为专业外籍人士开的酒吧；你可以在那喝上一周，却听不到两句日语。
'''

# 生成并保存音频
generator = pipeline(
    text, voice='af_heart', # 更改声音
    speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
    print(i)  # 索引
    print(gs) # 字符
    print(ps) # 音素
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # 保存每个音频文件

The above code runs in a Python environment, utilizing the kokoro The library enables text-to-speech conversion.

Experience it directly in your browser

You can experience Kokoro TTS directly in your browser without any installation. Please visit the link below:

https://huggingface.co/spaces/webml-community/kokoro-webgpu

Please note that the application is approximately 300+ megabytes and needs to be fully loaded before you can experience it. However, as an efficient TTS model that runs entirely in the browser, the wait is worth it.

Main function operation flow

text-to-speech

input text: Enter the text to be converted to speech in the input box.
Select Language: Select the language to be converted, e.g., English, French, Japanese, etc.
operational model: Click the Run button and the model will perform text-to-speech conversion.
View Results: The speech result will be displayed on the page and the user can play and download the generated speech file.

Featured Functions

Real-time conversion: Kokoro WebGPU utilizes WebGPU technology to achieve real-time text-to-speech conversion for application scenarios that require fast response.
Multi-language support: Supports text-to-speech conversion in multiple languages, users can choose different languages according to their needs.
Community Support: Powered by WebML Community, users can participate in community discussions and get technical support and updates.