Openai Whisper

Local speech-to-text with the Whisper CLI (no API key)

OpenAI Whisper is a community skill for speech-to-text transcription, covering audio file processing, real-time transcription, multi-language recognition, timestamp generation, and subtitle file creation for automated speech processing.

What Is This?

Overview

OpenAI Whisper provides AI agents with audio transcription capabilities using the Whisper speech recognition model. It covers audio file processing that transcribes recorded audio in formats including MP3, WAV, M4A, and FLAC with high accuracy, multi-language recognition that automatically detects and transcribes speech in over 90 languages without manual language specification, timestamp generation that produces word-level and segment-level timing data for precise audio alignment, subtitle file creation that outputs transcriptions in SRT and VTT formats ready for video players, and translation mode that transcribes foreign language audio directly into English text. The skill enables agents to process audio content for search, analysis, and accessibility workflows.

Who Should Use This

This skill serves content creators adding subtitles to video productions, developers building voice-enabled applications, and teams automating meeting transcription and audio content indexing. It is also well suited for researchers processing interview recordings and organizations building multilingual accessibility pipelines at scale.

Why Use It?

Problems It Solves

Manual audio transcription is extremely time-consuming and expensive for long recordings, often costing several dollars per minute of audio. Existing speech-to-text services often struggle significantly with accented speech, technical vocabulary, overlapping dialogue, and multi-speaker conversations. Creating properly timed and synchronized subtitle files requires specialized software and extensive manual timing adjustments for each segment. Audio content remains unsearchable, inaccessible to hearing-impaired users, and unable to be indexed without text transcription.

Core Highlights

Transcription engine processes audio files in multiple formats with high accuracy. Language detector automatically identifies the spoken language from over 90 supported languages. Timestamp generator produces word and segment-level timing data. Subtitle exporter creates ready-to-use SRT and VTT files for video integration.

How to Use It?

Basic Usage

import openai

client = openai.OpenAI()

with open(
    'recording.mp3', 'rb'
) as audio_file:
    transcript = client\
        .audio.transcriptions\
        .create(
            model='whisper-1',
            file=audio_file
        )

print(transcript.text)

Real-World Examples

result = client\
    .audio.transcriptions\
    .create(
        model='whisper-1',
        file=audio_file,
        response_format=
            'verbose_json',
        timestamp_granularities
            =['word',
             'segment'])

for seg in result\
        .segments:
    print(
        f'[{seg["start"]:.1f}'
        f'-{seg["end"]:.1f}]'
        f' {seg["text"]}')

srt_result = client\
    .audio.transcriptions\
    .create(
        model='whisper-1',
        file=audio_file,
        response_format=
            'srt')

with open('subs.srt', 'w'
) as f:
    f.write(srt_result)

Advanced Tips

Use the verbose_json response format to get both word-level and segment-level timestamps for precise subtitle timing. Provide a language hint parameter for audio with heavy accents to improve recognition accuracy. Split very long audio files into smaller chunks before processing to stay within API file size limits. When chunking audio, include a small overlap between segments of two to three seconds to avoid cutting words at boundaries and losing context.

When to Use It?

Use Cases

Generate subtitles for video content automatically from the audio track. Transcribe meeting recordings into searchable text documents with segment timestamps for easy navigation and reference. Build a voice note application that converts spoken memos into organized, categorized text notes for later review and search. Whisper also works well for transcribing podcast episodes, enabling full-text search across large content libraries without manual effort.

Related Topics

Speech recognition, audio processing, subtitle generation, natural language processing, accessibility tools, and voice transcription.

Important Notes

Requirements

A valid OpenAI API key with access to the Whisper model for authentication. Audio files in supported formats (MP3, WAV, M4A, FLAC) within the API size limit of 25MB per file. Python with the openai library installed for making API calls to the transcription endpoint.

Usage Recommendations

Do: use the verbose_json format when you need timestamp data for subtitle generation or audio alignment. Pre-process noisy audio with noise reduction and normalization tools before transcription to significantly improve accuracy. Specify the expected language when known to reduce detection errors and improve transcription quality.

Don't: send audio files larger than 25MB without splitting them into smaller segments first. Rely on transcription alone for critical content like legal or medical recordings without human review. Assume the model handles all dialects and domain-specific terminology perfectly without post-processing.

Limitations

Audio files are limited to 25MB per request which may require splitting longer recordings. Transcription accuracy varies with audio quality, background noise levels, and speaker accent or dialect. The model processes audio in a single batch pass and does not support real-time streaming transcription through the standard API. Speaker diarization identifying who said what is not built in and requires separate processing.