Whisper

Automate and integrate Whisper speech recognition into your audio workflows

Source: Orchestra-Research/AI-Research-SKILLs

Whisper is a community skill for using OpenAI's Whisper automatic speech recognition model, covering audio transcription, language detection, translation, timestamp generation, and integration patterns for speech-to-text processing pipelines.

What Is This?

Overview

Whisper provides guidance on using OpenAI's Whisper model for automatic speech recognition and audio transcription tasks. It covers audio transcription that converts speech to text across multiple languages with support for various audio formats including WAV, MP3, FLAC, and M4A, language detection that identifies the spoken language from audio input automatically, translation capabilities that transcribe non-English audio directly into English text in a single pass, timestamp generation that produces word-level and segment-level timing for subtitle creation, and model selection that offers sizes from tiny to large with different accuracy and speed tradeoffs. The skill helps developers integrate speech recognition into their applications.

Who Should Use This

This skill serves developers building speech-to-text features in applications, researchers processing audio datasets for analysis, and content creators generating subtitles and transcripts from recorded media.

Why Use It?

Problems It Solves

Commercial speech recognition APIs charge per minute of audio and require network connectivity for processing. Many transcription tools support only English or a limited set of languages with poor accuracy on accented speech. Generating accurate timestamps for subtitle creation requires specialized alignment tools separate from the transcription engine. Processing large audio archives through cloud APIs is expensive and slow compared to local batch processing.

Core Highlights

Transcription engine converts speech to text across 99 languages with high accuracy. Language detector automatically identifies spoken language from audio input. Translation pipeline transcribes non-English audio directly into English text. Timestamp generator produces word-level timing for subtitle and alignment workflows.

How to Use It?

Basic Usage

import whisper

model = whisper.load_model(
    'base')

result = model.transcribe(
    'audio.mp3')
print(result['text'])

audio = whisper.load_audio(
    'audio.mp3')
audio = whisper.pad_or_trim(
    audio)
mel = whisper.log_mel_spectrogram(
    audio).to(
        model.device)
_, probs = (
    model.detect_language(
        mel))
lang = max(
    probs,
    key=probs.get)
print(f'Language: {lang}')

Real-World Examples

import whisper

model = whisper.load_model(
    'medium')
result = model.transcribe(
    'video.mp4',
    word_timestamps=True)

def to_srt(segments):
    srt = []
    for i, seg in enumerate(
        segments, 1):
        start = format_ts(
            seg['start'])
        end = format_ts(
            seg['end'])
        text = seg['text']
            .strip()
        srt.append(
            f'{i}\n'
            f'{start} --> '
            f'{end}\n'
            f'{text}\n')
    return '\n'.join(srt)

def format_ts(seconds):
    h = int(seconds // 3600)
    m = int(
        seconds % 3600 // 60)
    s = int(seconds % 60)
    ms = int(
        (seconds % 1) * 1000)
    return (
        f'{h:02d}:{m:02d}:'
        f'{s:02d},{ms:03d}')

srt_text = to_srt(
    result['segments'])
with open('subs.srt',
    'w') as f:
    f.write(srt_text)

Advanced Tips

Use the medium or large model for production accuracy and fall back to the base or small model for faster processing when latency matters more than precision. Apply initial_prompt parameter to provide context about domain-specific vocabulary and improve recognition of technical terms. Use VAD-based preprocessing to split long audio files at silence boundaries before transcription for better segment quality.

When to Use It?

Use Cases

Generate accurate subtitles for video content with word-level timestamp alignment. Transcribe recorded meetings and lectures into searchable text documents for archival and reference. Build a multilingual transcription pipeline that processes audio in any language and outputs English text.

Important Notes

Requirements

Python with the openai-whisper package and PyTorch installed for running the speech recognition model locally. Sufficient GPU memory for larger model sizes since the large model requires approximately 10GB of VRAM for inference. FFmpeg installed on the system for audio format conversion and preprocessing of input files.

Usage Recommendations

Do: select the appropriate model size based on your accuracy needs and available compute resources. Preprocess audio to remove background noise and normalize volume levels before transcription for better results. Use the language parameter when the spoken language is known to skip automatic detection and improve accuracy.

Don't: use the large model for real-time applications where latency is critical since inference time scales with model size. Feed extremely long audio files without splitting since memory usage grows with input duration. Expect perfect accuracy on heavily accented speech or audio with significant background noise without preprocessing.

Limitations

Larger models provide better accuracy but require significantly more GPU memory and processing time for each audio segment. Real-time transcription is not practical with larger model sizes on consumer hardware without streaming optimizations. Recognition accuracy degrades noticeably on audio with overlapping speakers, heavy background noise, or domain-specific jargon not represented in the training data.

More Skills You Might Like

Explore similar skills to enhance your workflow