Speech To Text
Automate and integrate speech-to-text conversion for accurate and fast audio transcription
Speech To Text is a community skill for converting spoken audio into written text, covering audio preprocessing, transcription model integration, speaker diarization, timestamp alignment, and batch processing workflows for accurate speech recognition.
What Is This?
Overview
Speech To Text provides patterns for building transcription pipelines that convert audio recordings into structured text output. It covers audio format conversion and noise reduction preprocessing, transcription API integration with configurable language and model settings, speaker diarization that identifies who spoke each segment, word-level timestamp alignment for subtitle generation, and batch processing for handling large audio collections. The skill enables developers to build reliable speech recognition features for applications ranging from meeting notes to media captioning.
Who Should Use This
This skill serves developers building transcription features into communication tools, media companies producing captions and subtitles for video content, and teams creating searchable archives of recorded meetings and calls.
Why Use It?
Problems It Solves
Manual transcription is slow, expensive, and does not scale for large audio volumes. Raw transcription without speaker labels makes multi-person conversations difficult to follow. Audio with background noise or varying recording quality produces unreliable transcriptions without preprocessing. Timestamps are needed for subtitle generation but not all APIs provide word-level timing.
Core Highlights
Audio preprocessing normalizes format, sample rate, and volume levels before transcription. Model selection routes audio to appropriate engines based on language and quality requirements. Speaker diarization segments output by speaker identity for readable multi-person transcripts. Timestamp alignment provides word and sentence timing for subtitle and caption generation.
How to Use It?
Basic Usage
from dataclasses import dataclass, field
from pathlib import Path
@dataclass
class TranscriptionConfig:
language: str = "en"
model: str = "whisper-large-v3"
enable_timestamps: bool = True
enable_diarization: bool = False
@dataclass
class TranscriptSegment:
text: str
start_time: float = 0.0
end_time: float = 0.0
speaker: str = ""
class AudioPreprocessor:
def __init__(self, target_sr: int = 16000):
self.target_sr = target_sr
def validate(self, path: str) -> dict:
p = Path(path)
if not p.exists():
return {"valid": False, "error": "Not found"}
size_mb = p.stat().st_size / (1024 * 1024)
return {"valid": True, "size_mb": round(size_mb, 2),
"format": p.suffix}
def prepare(self, path: str,
output_dir: str) -> str:
out_path = str(
Path(output_dir) / f"{Path(path).stem}.wav")
return out_pathReal-World Examples
from dataclasses import dataclass, field
class TranscriptionPipeline:
def __init__(self, config: TranscriptionConfig,
transcribe_fn=None):
self.config = config
self.transcribe_fn = transcribe_fn
self.preprocessor = AudioPreprocessor()
def transcribe(self, audio_path: str
) -> list[TranscriptSegment]:
info = self.preprocessor.validate(audio_path)
if not info["valid"]:
return []
if self.transcribe_fn:
return self.transcribe_fn(audio_path)
return [TranscriptSegment(text="Transcription",
start_time=0.0,
end_time=1.0)]
def to_srt(self, segments: list[TranscriptSegment]
) -> str:
lines = []
for i, seg in enumerate(segments, 1):
start = self._format_time(seg.start_time)
end = self._format_time(seg.end_time)
lines.append(f"{i}")
lines.append(f"{start} --> {end}")
lines.append(seg.text)
lines.append("")
return "\n".join(lines)
def _format_time(self, seconds: float) -> str:
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
def batch_transcribe(self, paths: list[str]
) -> list[dict]:
results = []
for path in paths:
segments = self.transcribe(path)
full_text = " ".join(s.text for s in segments)
results.append({"file": path,
"text": full_text,
"segments": len(segments)})
return resultsAdvanced Tips
Split long audio files into chunks at silence boundaries to improve transcription accuracy and enable parallel processing. Use language detection on the first segment to automatically select the correct transcription model for multilingual content. Post-process transcriptions with a language model to correct common recognition errors and add punctuation.
When to Use It?
Use Cases
Build an automated meeting notes system that transcribes recordings with speaker labels and timestamps. Create a media captioning pipeline that generates SRT subtitle files from video audio tracks. Implement a voice search feature that transcribes spoken queries for text-based retrieval.
Related Topics
Automatic speech recognition, speaker diarization, subtitle generation, audio processing pipelines, and Whisper model integration.
Important Notes
Requirements
A speech recognition model or API such as Whisper or cloud transcription services. Audio processing tools for format conversion and noise reduction. Storage for transcription output files and intermediate audio.
Usage Recommendations
Do: preprocess audio to normalize volume and reduce noise before transcription. Validate audio file format and duration before submitting to the transcription API. Include timestamps in output for downstream applications like subtitle generation.
Don't: transcribe audio without checking file integrity, which causes silent failures. Send extremely long audio files as single requests without chunking. Assume perfect accuracy from any model without reviewing output.
Limitations
Transcription accuracy varies with audio quality, accents, and background noise levels. Speaker diarization is less reliable with more than five speakers or when speakers have similar voices. Real-time transcription introduces latency that may not suit interactive applications.
More Skills You Might Like
Explore similar skills to enhance your workflow
Postmark Automation
Automate Postmark email delivery tasks via Rube MCP (Composio): send templated emails, manage templates, monitor delivery stats and bounces. Always se
Dotsimple Automation
Automate Dotsimple operations through Composio's Dotsimple toolkit via
Pymoo
Specialized Pymoo automation and integration for multi-objective optimization problems
Referral Program
Enhance customer acquisition by automating referral program tracking and reward distribution for marketing teams
Add Educational Comments
add-educational-comments skill for education & learning
Leverly Automation
Automate Leverly operations through Composio's Leverly toolkit via Rube