Text To Speech

Convert text to natural-sounding speech with seamless automation and integration

Text To Speech is a community skill for converting written text into spoken audio, covering voice selection, prosody control, SSML markup, audio format configuration, and batch synthesis workflows for producing natural-sounding speech output.

What Is This?

Overview

Text To Speech provides patterns for building speech synthesis pipelines that convert text into audio output. It covers voice selection from provider catalogs with preview and comparison tools, prosody control for adjusting rate, pitch, and emphasis, SSML markup for fine-grained pronunciation and pause control, audio format configuration for output quality and file size, and batch processing for generating multiple audio files from text inputs. The skill enables developers to add natural-sounding speech output to applications for accessibility, content delivery, and user interaction.

Who Should Use This

This skill serves developers adding voice output to applications and chatbots, content teams producing audio versions of written material, and accessibility engineers building screen readers and voice interfaces.

Why Use It?

Problems It Solves

Recording human voice for every content update is costly and time-consuming. Default TTS voices sound robotic without prosody tuning for natural speech patterns. Pronunciation of technical terms, abbreviations, and proper nouns needs explicit guidance. Generating audio at scale for content libraries requires automated pipelines rather than manual recording.

Core Highlights

Voice catalog management lists and compares available voices with sample audio. Prosody configuration adjusts speaking rate, pitch, and volume for natural delivery. SSML support provides markup for pauses, emphasis, and pronunciation overrides. Batch synthesis processes text collections into organized audio file sets.

How to Use It?

Basic Usage

from dataclasses import dataclass, field

@dataclass
class VoiceConfig:
    voice_id: str
    language: str = "en-US"
    speaking_rate: float = 1.0
    pitch: float = 0.0
    volume_gain_db: float = 0.0
    output_format: str = "mp3"

class TTSClient:
    def __init__(self, config: VoiceConfig,
                 api_fn=None):
        self.config = config
        self.api_fn = api_fn

    def synthesize(self, text: str) -> bytes:
        if self.api_fn:
            return self.api_fn(text, self.config)
        return b""

    def synthesize_to_file(self, text: str,
                           output_path: str) -> dict:
        audio = self.synthesize(text)
        with open(output_path, "wb") as f:
            f.write(audio)
        return {"path": output_path,
                "size_bytes": len(audio),
                "voice": self.config.voice_id,
                "format": self.config.output_format}

Real-World Examples

from dataclasses import dataclass, field
from pathlib import Path

@dataclass
class SSMLBuilder:
    parts: list[str] = field(default_factory=list)

    def add_text(self, text: str):
        self.parts.append(text)

    def add_pause(self, ms: int):
        self.parts.append(f'<break time="{ms}ms"/>')

    def add_emphasis(self, text: str,
                     level: str = "moderate"):
        self.parts.append(
            f'<emphasis level="{level}">{text}</emphasis>')

    def build(self) -> str:
        inner = "".join(self.parts)
        return f"<speak>{inner}</speak>"

class BatchTTSProducer:
    def __init__(self, client: TTSClient):
        self.client = client
        self.completed: list[dict] = []

    def process(self, items: list[dict],
               output_dir: str) -> list[dict]:
        Path(output_dir).mkdir(parents=True, exist_ok=True)
        for i, item in enumerate(items):
            name = item.get("name", f"audio_{i:04d}")
            ext = self.client.config.output_format
            path = f"{output_dir}/{name}.{ext}"
            result = self.client.synthesize_to_file(
                item["text"], path)
            result["name"] = name
            self.completed.append(result)
        return self.completed

    def summary(self) -> dict:
        total = sum(r["size_bytes"] for r in self.completed)
        return {"files": len(self.completed),
                "total_bytes": total}

Advanced Tips

Use SSML markup for technical content where default pronunciation of abbreviations and symbols is incorrect. Split long text into paragraphs and synthesize each separately, then concatenate audio for better prosody at section boundaries. Cache generated audio keyed by text hash and voice configuration to avoid resynthesizing identical content.

When to Use It?

Use Cases

Add voice output to a chatbot that reads responses aloud for hands-free interaction. Generate audio versions of blog posts and documentation for users who prefer listening. Build an automated announcement system that converts text alerts into spoken messages delivered over audio channels.

Important Notes

Requirements

Access to a text-to-speech API such as Google Cloud TTS, Amazon Polly, or ElevenLabs. Audio storage for generated speech files. Understanding of SSML markup for controlling pronunciation and prosody.

Usage Recommendations

Do: preview voice selections with representative content before committing to a voice for production use. Use SSML for content with technical terms, numbers, or abbreviations that need pronunciation guidance. Set appropriate audio format and bitrate for the delivery platform.

Don't: synthesize entire documents as single API calls, which can exceed character limits. Use maximum speaking rate for content that users need to comprehend carefully. Ignore audio file sizes when generating large batches that accumulate storage costs.

Limitations

Synthesized speech quality varies significantly between voices and providers. Emotional expression in generated speech remains limited compared to human narration. Long-form synthesis can produce monotonous output without careful prosody configuration.

More Skills You Might Like

Explore similar skills to enhance your workflow