Runbook Generator

Automate and integrate Runbook Generator to streamline operational procedures

Runbook Generator is a community skill for creating operational runbooks, covering incident response procedures, troubleshooting guides, escalation workflows, automated diagnostics, and documentation templates for operations teams.

What Is This?

Overview

Runbook Generator provides tools for creating structured operational runbooks that guide teams through incident response and maintenance tasks. It covers incident response procedures that document step-by-step actions for common failure scenarios with diagnostic commands and remediation steps, troubleshooting guides that create decision trees for diagnosing root causes based on observed symptoms, escalation workflows that define when and how to escalate issues to specialized teams, automated diagnostics that embed executable commands for gathering system state during incidents, and documentation templates that standardize runbook format across teams. The skill helps operations teams respond to incidents consistently.

Who Should Use This

This skill serves site reliability engineers creating incident response documentation, operations teams standardizing troubleshooting procedures, and engineering managers building on-call reference materials.

Why Use It?

Problems It Solves

On-call engineers encounter unfamiliar systems during incidents and lack documented steps for diagnosis and remediation. Tribal knowledge about system behavior lives in individual memories rather than accessible documentation. Incident response quality varies depending on which engineer is on call. Escalation decisions are delayed when criteria and contact paths are not clearly defined.

Core Highlights

Procedure builder creates step-by-step incident response guides. Decision tree generator builds symptom-based diagnostic flows. Escalation mapper defines criteria and paths for issue handoff. Command embedder adds executable diagnostics to runbook steps.

How to Use It?

Basic Usage

from dataclasses import (
  dataclass, field)
import json

@dataclass
class Step:
  action: str
  command: str = ''
  expected: str = ''
  on_failure: str = ''

@dataclass
class Runbook:
  title: str
  service: str
  severity: str
  steps: list = field(
    default_factory=list)
  escalation: str = ''

  def add_step(
    self, step: Step
  ):
    self.steps.append(
      step)

  def to_markdown(self):
    md = (f'# {self.title}'
      f'\n\n'
      f'**Service:** '
      f'{self.service}\n'
      f'**Severity:** '
      f'{self.severity}\n\n')
    for i, s in enumerate(
      self.steps, 1
    ):
      md += (
        f'## Step {i}: '
        f'{s.action}\n\n')
      if s.command:
        md += (
          f'```bash\n'
          f'{s.command}\n'
          f'```\n\n')
      if s.expected:
        md += (
          f'Expected: '
          f'{s.expected}\n\n')
    return md

rb = Runbook(
  'DB Connection Fix',
  'api-server', 'P1')
rb.add_step(Step(
  'Check DB status',
  'pg_isready -h db01',
  'accepting connections'))
rb.add_step(Step(
  'Check connections',
  'SELECT count(*) FROM '
  'pg_stat_activity;',
  'Under max_connections'))
print(rb.to_markdown())

Real-World Examples

from dataclasses import (
  dataclass)

@dataclass
class DiagNode:
  question: str
  command: str = ''
  yes_action: str = ''
  no_action: str = ''
  yes_next: str = ''
  no_next: str = ''

class DiagTree:
  def __init__(
    self, name: str
  ):
    self.name = name
    self.nodes = {}

  def add_node(
    self,
    node_id: str,
    node: DiagNode
  ):
    self.nodes[
      node_id] = node

  def render(
    self
  ) -> str:
    lines = [
      f'# {self.name}',
      '']
    for nid, node in (
      self.nodes.items()
    ):
      lines.append(
        f'## {nid}: '
        f'{node.question}')
      if node.command:
        lines.append(
          f'Run: '
          f'`{node.command}`')
      lines.append(
        f'- Yes: '
        f'{node.yes_action}')
      lines.append(
        f'- No: '
        f'{node.no_action}')
      lines.append('')
    return '\n'.join(
      lines)

tree = DiagTree(
  'High Latency')
tree.add_node('A',
  DiagNode(
    'Is CPU above 80%?',
    'top -bn1',
    'Scale horizontally',
    'Check step B'))
tree.add_node('B',
  DiagNode(
    'Is memory full?',
    'free -h',
    'Restart service',
    'Check network'))
print(tree.render())

Advanced Tips

Include specific diagnostic commands in each step so engineers can copy and run them directly during incidents. Add expected output descriptions so responders can verify each step succeeded. Link related runbooks to create a network of procedures that cover complex multi-service incidents.

When to Use It?

Use Cases

Create an incident response runbook for database connection failures with diagnostic commands. Build a decision tree for diagnosing high latency issues across application tiers. Generate escalation procedures with clear criteria and contact paths.

Related Topics

Runbooks, incident response, SRE, operations, troubleshooting, on-call procedures, and documentation.

Important Notes

Requirements

Knowledge of system architecture and common failure modes for accurate runbook content. Access to diagnostic commands and monitoring tools referenced in runbook steps. Review process to validate runbook accuracy with subject matter experts.

Usage Recommendations

Do: test runbook steps in a non-production environment to verify commands work as documented. Update runbooks after each incident to incorporate lessons learned. Include rollback steps for any remediation action that modifies system state.

Don't: write runbooks that assume specific expertise since on-call engineers may be unfamiliar with the service. Include credentials or secrets directly in runbook text. Let runbooks become stale by skipping reviews after infrastructure changes.

Limitations

Runbooks cover known failure scenarios and cannot anticipate every possible incident. Complex incidents may require deviation from documented steps based on specific circumstances. Command outputs may differ across environments requiring engineers to adapt documented steps accordingly.