AI Personal Learning
and practical guidance
Beanbag Marscode1

Open source tool for real-time speech to text

General Introduction

realtime-transcription-fastrtc is an open source project that focuses on converting speech to text in real time. It utilizes FastRTC technology to process low-latency audio streams , combined with native Whisper model to achieve efficient speech recognition. The project is maintained by developer sofi444 and hosted on GitHub. The code is completely open and allows users to modify it freely. Users can use it through the browser or local deployment, and the interface supports Gradio and FastAPI modes, which is easy to operate. It is suitable for meeting records, live subtitles and other scenarios to meet the needs of individuals and developers. The project emphasizes lightweight and multi-language support, stable operation and easy to expand.

Real-time speech to text open source tool-1


 

Function List

  • Real-time voice transcription: Instantly convert voice to text via microphone with low latency down to milliseconds.
  • Voice Activity Detection (VAD): automatically identifies the beginning and end of speech to optimize the transcription process.
  • Multi-language support: English, Chinese and other languages, based on the Whisper model.
  • Dual interface options: Gradio intuitive interface and FastAPI customizable interface available.
  • Local Model Runs: Uses Whisper models and supports offline transcription without the need for a constant internet connection.
  • Parameter Tuning: Supports configuration of audio streams, VAD thresholds and model batch sizes.
  • Flexible deployment: can be run locally or deployed through platforms such as Hugging Face Spaces.
  • Error Feedback: Provide clear indication of connection failure or configuration error for easy debugging.

 

Using Help

Installation process

To use realtime-transcription-fastrtc, you need to prepare your Python environment and related dependencies. Below are the detailed steps to ensure that users can install and run it without any problems.

  1. Checking system requirements
    • Python version: >= 3.10.
    • mounting ffmpeg, used for audio processing.
    • Recommended hardware: GPUs (e.g. MPS or CUDA) to accelerate model inference, CPUs can also run but are slower.
  2. clone warehouse
    Run the following command in the terminal to get the project code:

    git clone https://github.com/sofi444/realtime-transcription-fastrtc
    cd realtime-transcription-fastrtc
    
  3. Setting up a virtual environment
    To avoid dependency conflicts, create a Python virtual environment. There are two officially recommended ways to do this:
    Mode 1: Use of uv (recommended)
    first install uv(Ref. https://docs.astral.sh/uv/) and then run:

    uv venv --python 3.11
    source .venv/bin/activate  # Windows 用户运行 .venv\Scripts\activate
    uv pip install -r requirements.txt
    

    Way 2: Use pip

    python -m venv .venv
    source .venv/bin/activate  # Windows 用户运行 .venv\Scripts\activate
    pip install --upgrade pip
    pip install -r requirements.txt
    
  4. Install ffmpeg
    Installation according to operating system ffmpeg::
    macOS::

    brew install ffmpeg
    

    Linux (Ubuntu/Debian)::

    sudo apt update
    sudo apt install ffmpeg
    

    Windows (computer)::

    • downloading ffmpeg Executable (from https://ffmpeg.org/download.html).
    • Add it to the system environment variable or put it in the project root directory.
  5. Configuring Environment Variables
    In the project root directory, create the .env file, add the following:

    UI_MODE=fastapi
    APP_MODE=local
    SERVER_NAME=localhost
    PORT=7860
    MODEL_ID=openai/whisper-large-v3-turbo
    
    • UI_MODE: set to gradio Using the Gradio interface, set fastapi Use a custom HTML interface (default).
    • APP_MODE: The local run is set to localThe cloud deployment is set to deployedThe
    • MODEL_ID: Specifies the Whisper model, default openai/whisper-large-v3-turboThe
    • SERVER_NAME: server address, default localhostThe
    • PORT: port number, default 7860The
  6. Running Projects
    Run the main program:

    python main.py
    

    The terminal displays a URL (e.g. http://localhost:7860The port may be different in Gradio mode, so pay attention to the terminal prompts.

Main Functions

Real-time voice transcription

  • Start transcription: Open the interface and click the "Start Recording" button to authorize your browser to access the microphone. The system will automatically detect the voice and display the text.
  • View Results: Transcribed text is displayed in real-time in the interface text box, automatically scrolling to the latest content.
  • Suspension of transcription: Click the "Stop" button to pause the voice input.
  • take note of: To ensure low latency, the project defaults to a batch size of 1, i.e. every audio clip received is immediately transcribed.

Voice Activity Detection (VAD)

  • The VAD automatically distinguishes between voice and silence to improve transcription efficiency. Adjustable parameters (refer to FastRTC documentation) https://fastrtc.org):
    • audio_chunk_duration: Length of the audio clip, default 0.6 seconds.
    • started_talking_threshold: Speech start threshold, default 0.2 seconds.
    • speech_pad_ms: Silent fill, default 400 milliseconds.
  • Modification: Edit main.py or pass in parameters via environment variables.

Interface switching

  • Gradio Interface: Ideal for quick tests, the interface contains a record button and a text display area. Settings UI_MODE=gradio Run it afterward to access the address prompted by the terminal.
  • FastAPI Interface: Support customization, suitable for developers. Modify index.html Styles or features can be adjusted. Settings UI_MODE=fastapi After running, access the http://localhost:8000The

Featured Function Operation

Local Whisper Models

  • Default model:openai/whisper-large-v3-turbo, lightweight, multi-lingual, and excellent performance.
  • Changing Models: Settings MODEL_IDe.g. openai/whisper-small(for low profile devices). Support for other Hugging Face ASR models (https://huggingface.co/models?pipeline_tag=automatic-speech-recognition).
  • Language Settings: Default translation is English, when you need to transcribe other languages, set it in the code. language parameters (e.g. language=zh (Indicates Chinese).
  • Run optimization: The first run will warm up the model to reduce latency. GPU acceleration is recommended.

Multi-language support

  • Support for English, Chinese, Spanish and other languages, depending on the model.
  • Configuration: In the main.py set up in transcribe task and specify the target language.
  • Example: Transcribing Chinese speech, setting language=zh, make sure the microphone input is clear.

Cloud Deployment

  • Hugging Face Spaces: Settings APP_MODE=deployedTo configure the Turn server (refer to https://fastrtc.org/deployment/). Upload the code and run it as prompted by the platform.
  • Other platforms: You need to manually configure WebRTC and the server environment to ensure that the ports are open.

error handling

  • common error::
    • "Failed to connect": Check the network or WebRTC configuration.
    • "Model not found": confirmed MODEL_ID Correct and the model has been downloaded.
    • "ffmpeg not found": ensure that the ffmpeg Installed and in the system path.
  • adjust components during testing: View terminal logs at runtime to record audio sample rate, model loading status, etc.

caveat

  • software: GPU recommended for real-time reasoning, MPS support whisper-large-v3-turboThe
  • browser (software): Chrome or Firefox is recommended to ensure that WebRTC is functioning properly.
  • speech accuracy: Subject to microphone quality and environment, recommended for use in quiet environments.

 

application scenario

  1. proceedings
    In remote or on-site meetings, transcribe discussions in real time to generate transcripts. Teams can export and organize directly, eliminating the need for manual notes.
  2. live captioning
    Add real-time captioning to live broadcasts to improve content accessibility. Anchors can quickly maneuver through the Gradio interface and viewers instantly see the text.
  3. language learning
    Transcribe pronunciation as text to check accuracy when students are practicing a foreign language. Supports multiple languages, suitable for English, Chinese and other learning scenarios.
  4. development cooperation
    Developers can integrate the project into other applications to test WebRTC or ASR functionality. The open code supports secondary development.

 

QA

  1. Do I need to network?
    No internet connection is required for local operation, models can be downloaded and used offline. Cloud deployment requires network support for WebRTC.
  2. What languages are supported?
    English is supported by default. Settings language Parameters can support Chinese, Spanish, etc., depending on the Whisper model.
  3. How to improve transcription accuracy?
    Use a high-quality microphone, maintain a quiet environment, and choose a large model (such as a whisper-large-v3-turbo).
  4. Can I customize the interface?
    Yes, FastAPI mode editable index.html, adjusting styles or adding features.
  5. Why is transcription delayed?
    May be due to lack of hardware performance or network issues. GPU is recommended, check WebRTC configuration.
May not be reproduced without permission:Chief AI Sharing Circle " Open source tool for real-time speech to text
en_USEnglish