General Introduction
realtime-transcription-fastrtc is an open source project that focuses on converting speech to text in real time. It utilizes FastRTC technology to process low-latency audio streams , combined with native Whisper model to achieve efficient speech recognition. The project is maintained by developer sofi444 and hosted on GitHub. The code is completely open and allows users to modify it freely. Users can use it through the browser or local deployment, and the interface supports Gradio and FastAPI modes, which is easy to operate. It is suitable for meeting records, live subtitles and other scenarios to meet the needs of individuals and developers. The project emphasizes lightweight and multi-language support, stable operation and easy to expand.
Function List
- Real-time voice transcription: Instantly convert voice to text via microphone with low latency down to milliseconds.
- Voice Activity Detection (VAD): automatically identifies the beginning and end of speech to optimize the transcription process.
- Multi-language support: English, Chinese and other languages, based on the Whisper model.
- Dual interface options: Gradio intuitive interface and FastAPI customizable interface available.
- Local Model Runs: Uses Whisper models and supports offline transcription without the need for a constant internet connection.
- Parameter Tuning: Supports configuration of audio streams, VAD thresholds and model batch sizes.
- Flexible deployment: can be run locally or deployed through platforms such as Hugging Face Spaces.
- Error Feedback: Provide clear indication of connection failure or configuration error for easy debugging.
Using Help
Installation process
To use realtime-transcription-fastrtc, you need to prepare your Python environment and related dependencies. Below are the detailed steps to ensure that users can install and run it without any problems.
- Checking system requirements
- Python version: >= 3.10.
- mounting
ffmpeg
, used for audio processing. - Recommended hardware: GPUs (e.g. MPS or CUDA) to accelerate model inference, CPUs can also run but are slower.
- clone warehouse
Run the following command in the terminal to get the project code:git clone https://github.com/sofi444/realtime-transcription-fastrtc cd realtime-transcription-fastrtc
- Setting up a virtual environment
To avoid dependency conflicts, create a Python virtual environment. There are two officially recommended ways to do this:
Mode 1: Use of uv (recommended)
first installuv
(Ref.https://docs.astral.sh/uv/
) and then run:uv venv --python 3.11 source .venv/bin/activate # Windows 用户运行 .venv\Scripts\activate uv pip install -r requirements.txt
Way 2: Use pip
python -m venv .venv source .venv/bin/activate # Windows 用户运行 .venv\Scripts\activate pip install --upgrade pip pip install -r requirements.txt
- Install ffmpeg
Installation according to operating systemffmpeg
::
macOS::brew install ffmpeg
Linux (Ubuntu/Debian)::
sudo apt update sudo apt install ffmpeg
Windows (computer)::
- downloading
ffmpeg
Executable (fromhttps://ffmpeg.org/download.html
). - Add it to the system environment variable or put it in the project root directory.
- downloading
- Configuring Environment Variables
In the project root directory, create the.env
file, add the following:UI_MODE=fastapi APP_MODE=local SERVER_NAME=localhost PORT=7860 MODEL_ID=openai/whisper-large-v3-turbo
UI_MODE
: set togradio
Using the Gradio interface, setfastapi
Use a custom HTML interface (default).APP_MODE
: The local run is set tolocal
The cloud deployment is set todeployed
TheMODEL_ID
: Specifies the Whisper model, defaultopenai/whisper-large-v3-turbo
TheSERVER_NAME
: server address, defaultlocalhost
ThePORT
: port number, default7860
The
- Running Projects
Run the main program:python main.py
The terminal displays a URL (e.g.
http://localhost:7860
The port may be different in Gradio mode, so pay attention to the terminal prompts.
Main Functions
Real-time voice transcription
- Start transcription: Open the interface and click the "Start Recording" button to authorize your browser to access the microphone. The system will automatically detect the voice and display the text.
- View Results: Transcribed text is displayed in real-time in the interface text box, automatically scrolling to the latest content.
- Suspension of transcription: Click the "Stop" button to pause the voice input.
- take note of: To ensure low latency, the project defaults to a batch size of 1, i.e. every audio clip received is immediately transcribed.
Voice Activity Detection (VAD)
- The VAD automatically distinguishes between voice and silence to improve transcription efficiency. Adjustable parameters (refer to FastRTC documentation)
https://fastrtc.org
):audio_chunk_duration
: Length of the audio clip, default 0.6 seconds.started_talking_threshold
: Speech start threshold, default 0.2 seconds.speech_pad_ms
: Silent fill, default 400 milliseconds.
- Modification: Edit
main.py
or pass in parameters via environment variables.
Interface switching
- Gradio Interface: Ideal for quick tests, the interface contains a record button and a text display area. Settings
UI_MODE=gradio
Run it afterward to access the address prompted by the terminal. - FastAPI Interface: Support customization, suitable for developers. Modify
index.html
Styles or features can be adjusted. SettingsUI_MODE=fastapi
After running, access thehttp://localhost:8000
The
Featured Function Operation
Local Whisper Models
- Default model:
openai/whisper-large-v3-turbo
, lightweight, multi-lingual, and excellent performance. - Changing Models: Settings
MODEL_ID
e.g.openai/whisper-small
(for low profile devices). Support for other Hugging Face ASR models (https://huggingface.co/models?pipeline_tag=automatic-speech-recognition
). - Language Settings: Default translation is English, when you need to transcribe other languages, set it in the code.
language
parameters (e.g.language=zh
(Indicates Chinese). - Run optimization: The first run will warm up the model to reduce latency. GPU acceleration is recommended.
Multi-language support
- Support for English, Chinese, Spanish and other languages, depending on the model.
- Configuration: In the
main.py
set up intranscribe
task and specify the target language. - Example: Transcribing Chinese speech, setting
language=zh
, make sure the microphone input is clear.
Cloud Deployment
- Hugging Face Spaces: Settings
APP_MODE=deployed
To configure the Turn server (refer tohttps://fastrtc.org/deployment/
). Upload the code and run it as prompted by the platform. - Other platforms: You need to manually configure WebRTC and the server environment to ensure that the ports are open.
error handling
- common error::
- "Failed to connect": Check the network or WebRTC configuration.
- "Model not found": confirmed
MODEL_ID
Correct and the model has been downloaded. - "ffmpeg not found": ensure that the
ffmpeg
Installed and in the system path.
- adjust components during testing: View terminal logs at runtime to record audio sample rate, model loading status, etc.
caveat
- software: GPU recommended for real-time reasoning, MPS support
whisper-large-v3-turbo
The - browser (software): Chrome or Firefox is recommended to ensure that WebRTC is functioning properly.
- speech accuracy: Subject to microphone quality and environment, recommended for use in quiet environments.
application scenario
- proceedings
In remote or on-site meetings, transcribe discussions in real time to generate transcripts. Teams can export and organize directly, eliminating the need for manual notes. - live captioning
Add real-time captioning to live broadcasts to improve content accessibility. Anchors can quickly maneuver through the Gradio interface and viewers instantly see the text. - language learning
Transcribe pronunciation as text to check accuracy when students are practicing a foreign language. Supports multiple languages, suitable for English, Chinese and other learning scenarios. - development cooperation
Developers can integrate the project into other applications to test WebRTC or ASR functionality. The open code supports secondary development.
QA
- Do I need to network?
No internet connection is required for local operation, models can be downloaded and used offline. Cloud deployment requires network support for WebRTC. - What languages are supported?
English is supported by default. Settingslanguage
Parameters can support Chinese, Spanish, etc., depending on the Whisper model. - How to improve transcription accuracy?
Use a high-quality microphone, maintain a quiet environment, and choose a large model (such as awhisper-large-v3-turbo
). - Can I customize the interface?
Yes, FastAPI mode editableindex.html
, adjusting styles or adding features. - Why is transcription delayed?
May be due to lack of hardware performance or network issues. GPU is recommended, check WebRTC configuration.