AI Personal Learning
and practical guidance

BetterWhisperX: Automated speech recognition separated from the speaker, providing highly accurate word-level timestamps

General Introduction

BetterWhisperX is an optimized version of the WhisperX project focused on providing efficient and accurate Automatic Speech Recognition (ASR) services. As an improved offshoot of WhisperX, the project is maintained by Federico Torrielli, who is committed to keeping the project continuously updated and improving its performance.BetterWhisperX integrates a number of advanced technologies, including forced alignment at the phoneme level, batch processing based on speech activity, and speaker separation. The tool not only supports high-speed transcription (up to 70x real-time speed when using the large-v2 model), but also provides accurate word-level timestamping and multi-speaker recognition. The system uses faster-whisper as the backend, which requires less GPU memory even for large models, providing a very high performance to efficiency ratio.

BetterWhisperX: Automated speech recognition separated from the speaker, providing high-precision word-level timestamps-1


 

Function List

  • Fast Speech to Text: Support for 70x real-time transcription using the large model large-v2.
  • Word-level timestamps: Provides accurate word-level timestamps via wav2vec2 alignment.
  • multi-speaker recognition: Utilize pyannote-audio for speaker separation and labeling.
  • Voice Activity Detection: Reduced misidentification and batch processing with no significant error rate increase.
  • batch inference: Support batch processing to improve processing efficiency.
  • compatibility: Supports PyTorch 2.0 and Python 3.10 for a wide range of environments.

 

Using Help

Detailed Operation Procedure

  1. Preparing Audio Files: Make sure the audio file format is WAV or MP3 and the sound quality is clear.
  2. Loading Models: Select the appropriate model (e.g. large-v2) according to the requirements and load it into memory.
  3. Perform transcription: Call the transcribe function to perform speech-to-text processing and get the preliminary transcription results.
  4. aligned timestamp: Word-level timestamp alignment of transcription results using the align function to ensure accurate timestamps.
  5. speaker separation: Call the diarize function for multi-speaker recognition to get the label and the corresponding speech fragment for each speaker.
  6. Result Output: Save the final result as a text file or JSON format for subsequent processing and analysis.

1. Environmental preparation

  1. System Requirements:
    • Python 3.10 environment (mamba or conda is recommended to create a virtual environment)
    • CUDA and cuDNN support (required for GPU acceleration)
    • FFmpeg Toolkit
  2. Installation Steps:
# Create a Python environment
mamba create -n whisperx python=3.10
mamba activate whisperx
# Install CUDA and cuDNN
mamba install cuda cudnn
# install BetterWhisperX
pip install git+https://github.com/federicotorrielli/BetterWhisperX.git

2. Basic methods of use

  1. Command Line Usage:
# Basic Transcription (English)
whisperx audio.wav
# Using Large Models and Higher Accuracy
whisperx audio.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4
# Enable speaker separation
whisperx audio.wav ---model large-v2 --diarize --highlight_words True
# CPU mode (for Mac OS X)
whisperx audio.wav --compute_type int8
  1. Python code calls:
import whisperx
import gc
device = "cuda"
audio_file = "audio.mp3"
batch_size = 16 # GPU can be lowered if memory is low
compute_type = "float16" # Use "int8" if out of memory.
# 1. load model and transcribe it
model = whisperx.load_model("large-v2", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
# 2. phoneme alignment
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
# 3. speaker separation (requires Hugging Face token)
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

3. Performance optimization recommendations

  1. GPU memory optimization:
    • Reduce batch size (batch_size)
    • Use smaller models (e.g. base instead of large)
    • Select lightweight calculation type (int8)
  2. Multi-language support:
    • Default supported languages: English, French, German, Spanish, Italian, Japanese, Chinese, Dutch, Ukrainian, Portuguese
    • Specify the language to use:--language de(Example in German)

4. Cautions

  • Timestamps may not be accurate enough for special characters (e.g., numbers, currency symbols)
  • Recognition of scenes where multiple people are talking at the same time may not be effective
  • Speaker separation is still being optimized
  • Hugging Face access tokens are required to use the speaker separation feature
  • Ensure GPU driver and CUDA version compatibility
May not be reproduced without permission:Chief AI Sharing Circle " BetterWhisperX: Automated speech recognition separated from the speaker, providing highly accurate word-level timestamps

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish