General Introduction
CrisperWhisper is an advanced speech recognition tool based on OpenAI Whisper that focuses on fast, accurate and word-by-word speech transcription. It delivers accurate word-level timestamps, even in the presence of speech fills and pauses.CrisperWhisper improves timestamp accuracy by adjusting the tagger and customizing attention loss, and reduces transcription illusions to ensure that every pronounced word is accurately recorded.
Paper Summary
CrisperWhisper is an improved version of the Whisper speech recognition model, which, by adjusting the tokenizer and using the Dynamic Time Warping (DTW) algorithm, enables more accurate word-level timestamping, provides more fine-grained speech transcription, enhances the detection of pause and fill events in speech, and reduces the production of illusory ( It also reduces the generation of hallucinations.
summaries
CrisperWhisper is an enhancement based on the Whisper model designed to provide more accurate word-level timestamping and more granular speech transcription. The model improves timestamping accuracy by adjusting Whisper's tokenizer so that the DTW algorithm can more accurately align audio clips with words. This technique is particularly useful for capturing speech transcription across all pronunciations, which is important for clinical assessment of speech, analyzing the language planning process, and identifying indicators of cognitive load.CrisperWhisper also improves attention and noise adaptation to monophonic sound sources by training and counteracting noise, and has been tested on multiple benchmark datasets, demonstrating its use in speech recognition, segmentation, filler event detection, and illusory content reduction. In addition, the code for the model and a synthetic dataset with accurate word-level timestamps have been made available.
Viewpoints
- Improved Tokenizer: CrisperWhisper improves the accuracy of timestamping by removing redundant spaces in the tokenizer and re-tagging specific words such as "uh" and "um" so that the DTW algorithm is able to more accurately align audio segments with words.
- Anti-noise technology: The model improves adaptation to noise by including data from noisy and polyphonic sources during training, and reduces the generation of illusory content by introducing blank training samples.
- Superior performance: CrisperWhisper has been tested on several benchmark datasets, including AMI Meeting Corpus, TED-LIUM, and LibriSpeech, and has demonstrated excellent word-level time stamping and speech recognition performance on these datasets.
- Open Source Code and Data SetsThe code for the model and a synthesized speech dataset have been made open-source, which will help researchers and developers to further study and improve speech recognition techniques.
- Reduction of virtual content: CrisperWhisper effectively reduces the generation of fictitious content through precise time stamping and specific processing of fictitious content, which is particularly important for improving the reliability of speech recognition systems.
Function List
- Accurate word-level timestamps: Provides accurate timestamps even with speech fills and pauses.
- verbatim transcription: Record each pronounced word verbatim, including fillers such as "um" and "ah".
- filler word detection: Detect and accurately transcribe filler words.
- Reduced hallucinations: Reducing transcriptional hallucinations and improving accuracy.
- open source: The code is publicly available for easy viewing and use.
Using Help
Installation process
- environmental preparation::
- Ensure that Python 3.7 and above is installed.
- Install the necessary dependency libraries:
pip install -r requirements.txt
The
- Download Code::
- Clone a GitHub repository:
git clone https://github.com/nyrahealth/CrisperWhisper.git
The
- Clone a GitHub repository:
- Running the application::
- Go to the project catalog:
cd CrisperWhisper
The - Run the application:
python app.py
The
- Go to the project catalog:
Guidelines for use
- Basic use::
- After opening the app, upload the audio file that needs to be transcribed.
- Select the transcription mode (verbatim or standard transcription).
- Click the "Start Transcription" button and wait for the transcription to complete.
- Advanced Features::
- Timestamp adjustment: The precision of the timestamp can be adjusted in the settings.
- filler word detection: Enables or disables filler word detection.
- Export results: Once the transcription is complete, the results can be exported to a text file or other format.
- common problems::
- inaccurate transcription: Ensure good audio quality and avoid background noise.
- imprecise time stamp: Try adjusting the timestamp settings, or using a higher quality audio file.
typical example
- Example of verbatim transcription::
Original audio: Well, I find this project very interesting. TRANSCRIPT RESULT: Well, I find this project very interesting. Timestamp: [0:00:01] Well, [0:00:02] I, [0:00:03] find, [0:00:04] this, [0:00:05] project, [0:00:06] very, [0:00:07] interesting.
- Example of filler word detection::
Original audio: Well, I find this project very interesting. TRANSCRIPT RESULT: Well, I find this project very interesting. Filler word: [0:00:01] hmmm