Speech To Text

Automate and integrate speech-to-text conversion for accurate and fast audio transcription

Speech To Text is a community skill for converting spoken audio into written text, covering audio preprocessing, transcription model integration, speaker diarization, timestamp alignment, and batch processing workflows for accurate speech recognition.

What Is This?

Overview

Speech To Text provides patterns for building transcription pipelines that convert audio recordings into structured text output. It covers audio format conversion and noise reduction preprocessing, transcription API integration with configurable language and model settings, speaker diarization that identifies who spoke each segment, word-level timestamp alignment for subtitle generation, and batch processing for handling large audio collections. The skill enables developers to build reliable speech recognition features for applications ranging from meeting notes to media captioning.

Who Should Use This

This skill serves developers building transcription features into communication tools, media companies producing captions and subtitles for video content, and teams creating searchable archives of recorded meetings and calls.

Why Use It?

Problems It Solves

Manual transcription is slow, expensive, and does not scale for large audio volumes. Raw transcription without speaker labels makes multi-person conversations difficult to follow. Audio with background noise or varying recording quality produces unreliable transcriptions without preprocessing. Timestamps are needed for subtitle generation but not all APIs provide word-level timing.

Core Highlights

Audio preprocessing normalizes format, sample rate, and volume levels before transcription. Model selection routes audio to appropriate engines based on language and quality requirements. Speaker diarization segments output by speaker identity for readable multi-person transcripts. Timestamp alignment provides word and sentence timing for subtitle and caption generation.

How to Use It?

Basic Usage

from dataclasses import dataclass, field
from pathlib import Path

@dataclass
class TranscriptionConfig:
    language: str = "en"
    model: str = "whisper-large-v3"
    enable_timestamps: bool = True
    enable_diarization: bool = False

@dataclass
class TranscriptSegment:
    text: str
    start_time: float = 0.0
    end_time: float = 0.0
    speaker: str = ""

class AudioPreprocessor:
    def __init__(self, target_sr: int = 16000):
        self.target_sr = target_sr

    def validate(self, path: str) -> dict:
        p = Path(path)
        if not p.exists():
            return {"valid": False, "error": "Not found"}
        size_mb = p.stat().st_size / (1024 * 1024)
        return {"valid": True, "size_mb": round(size_mb, 2),
                "format": p.suffix}

    def prepare(self, path: str,
               output_dir: str) -> str:
        out_path = str(
            Path(output_dir) / f"{Path(path).stem}.wav")
        return out_path

Real-World Examples

from dataclasses import dataclass, field

class TranscriptionPipeline:
    def __init__(self, config: TranscriptionConfig,
                 transcribe_fn=None):
        self.config = config
        self.transcribe_fn = transcribe_fn
        self.preprocessor = AudioPreprocessor()

    def transcribe(self, audio_path: str
                   ) -> list[TranscriptSegment]:
        info = self.preprocessor.validate(audio_path)
        if not info["valid"]:
            return []
        if self.transcribe_fn:
            return self.transcribe_fn(audio_path)
        return [TranscriptSegment(text="Transcription",
                                  start_time=0.0,
                                  end_time=1.0)]

    def to_srt(self, segments: list[TranscriptSegment]
              ) -> str:
        lines = []
        for i, seg in enumerate(segments, 1):
            start = self._format_time(seg.start_time)
            end = self._format_time(seg.end_time)
            lines.append(f"{i}")
            lines.append(f"{start} --> {end}")
            lines.append(seg.text)
            lines.append("")
        return "\n".join(lines)

    def _format_time(self, seconds: float) -> str:
        h = int(seconds // 3600)
        m = int((seconds % 3600) // 60)
        s = int(seconds % 60)
        ms = int((seconds % 1) * 1000)
        return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

    def batch_transcribe(self, paths: list[str]
                         ) -> list[dict]:
        results = []
        for path in paths:
            segments = self.transcribe(path)
            full_text = " ".join(s.text for s in segments)
            results.append({"file": path,
                           "text": full_text,
                           "segments": len(segments)})
        return results

Advanced Tips

Split long audio files into chunks at silence boundaries to improve transcription accuracy and enable parallel processing. Use language detection on the first segment to automatically select the correct transcription model for multilingual content. Post-process transcriptions with a language model to correct common recognition errors and add punctuation.

When to Use It?

Use Cases

Build an automated meeting notes system that transcribes recordings with speaker labels and timestamps. Create a media captioning pipeline that generates SRT subtitle files from video audio tracks. Implement a voice search feature that transcribes spoken queries for text-based retrieval.

Related Topics

Automatic speech recognition, speaker diarization, subtitle generation, audio processing pipelines, and Whisper model integration.

Important Notes

Requirements

A speech recognition model or API such as Whisper or cloud transcription services. Audio processing tools for format conversion and noise reduction. Storage for transcription output files and intermediate audio.

Usage Recommendations

Do: preprocess audio to normalize volume and reduce noise before transcription. Validate audio file format and duration before submitting to the transcription API. Include timestamps in output for downstream applications like subtitle generation.

Don't: transcribe audio without checking file integrity, which causes silent failures. Send extremely long audio files as single requests without chunking. Assume perfect accuracy from any model without reviewing output.

Limitations

Transcription accuracy varies with audio quality, accents, and background noise levels. Speaker diarization is less reliable with more than five speakers or when speakers have similar voices. Real-time transcription introduces latency that may not suit interactive applications.