Incident Commander
Streamline emergency response and IT operations with automated incident management and communication
Incident Commander is a community skill for managing production incidents with structured response processes, covering incident declaration, severity classification, communication management, timeline tracking, and post-incident review for engineering operations teams.
What Is This?
Overview
Incident Commander provides frameworks for coordinating production incident response from detection through resolution. It covers incident declaration that creates structured incident records with description, affected services, and initial responders, severity classification that assigns incident levels based on user impact, service degradation extent, and revenue effect, communication management that generates status updates for stakeholders, customers, and engineering teams at appropriate intervals, timeline tracking that records key events, actions taken, and decisions made during the incident for later analysis, and post-incident review that produces structured retrospective documents with root cause analysis and action items. The skill enables operations teams to respond to incidents systematically under pressure.
Who Should Use This
This skill serves on-call engineers responding to production incidents, engineering managers coordinating incident response, and SRE teams building incident management processes.
Why Use It?
Problems It Solves
Incident response without a structured process leads to chaotic coordination where multiple engineers duplicate investigation efforts. Severity classification inconsistencies cause either over-escalation of minor issues or under-response to critical outages. Communication gaps leave stakeholders uninformed and generate escalation pressure that distracts responders. Post-incident reviews lack actionable detail when the timeline was not recorded during the event.
Core Highlights
Incident creator declares incidents with severity, affected services, and assigned roles. Status updater generates stakeholder communications at configured intervals. Timeline recorder captures events, decisions, and actions during response. Retrospective generator produces structured post-incident review documents.
How to Use It?
Basic Usage
from datetime import (
datetime)
from enum import Enum
class Severity(Enum):
SEV1 = 'critical'
SEV2 = 'major'
SEV3 = 'minor'
class Incident:
def __init__(
self,
title: str,
severity: Severity,
services: list[str]
):
self.title = title
self.severity = (
severity)
self.services = (
services)
self.created = (
datetime.now())
self.timeline = []
self.status = 'active'
self.commander = None
def add_event(
self,
description: str,
author: str
):
self.timeline.append({
'time':
datetime.now()
.isoformat(),
'event':
description,
'author': author})
def resolve(
self,
resolution: str
):
self.status = (
'resolved')
self.add_event(
f'Resolved: '
f'{resolution}',
self.commander)Real-World Examples
class StatusUpdater:
TEMPLATES = {
'investigating': (
'We are investigating'
' {title}. Affected'
' services: {svcs}.'
' Severity: {sev}.'
' Next update in'
' {interval} min.'),
'identified': (
'Root cause identified'
' for {title}.'
' Working on fix.'
' ETA: {eta}.'),
'resolved': (
'{title} has been'
' resolved.'
' Resolution:'
' {resolution}.'
' Duration:'
' {duration}.')}
def generate(
self,
incident: Incident,
phase: str,
**kwargs
) -> str:
template = (
self.TEMPLATES[phase])
return template.format(
title=(
incident.title),
svcs=', '.join(
incident.services),
sev=incident
.severity.value,
**kwargs)
def retrospective(
self,
incident: Incident
) -> dict:
return {
'title':
incident.title,
'severity':
incident.severity
.value,
'duration': str(
datetime.now()
- incident
.created),
'timeline':
incident.timeline,
'services':
incident.services}Advanced Tips
Define escalation thresholds that automatically increase severity when resolution time exceeds defined limits per severity level. Pre-configure communication channels for each severity tier so status updates reach the right audiences without manual routing. Practice incident response regularly with game day exercises to build team familiarity with the process.
When to Use It?
Use Cases
Coordinate response to a production outage with structured roles and communication cadence. Generate stakeholder status updates during an active incident at regular intervals. Produce a post-incident retrospective document with timeline and action items.
Related Topics
Incident management, site reliability engineering, on-call response, post-mortems, communication plans, severity classification, and operations.
Important Notes
Requirements
Communication channels configured for incident notifications. On-call rotation with defined escalation paths. Monitoring and alerting system for incident detection and severity assessment.
Usage Recommendations
Do: assign a single incident commander who owns coordination and communication for each incident. Record timeline entries in real time rather than reconstructing after resolution. Conduct blameless retrospectives focused on process improvement.
Don't: change incident severity without documenting the reason and notifying stakeholders. Skip post-incident reviews for lower severity incidents that still had learning value. Allow multiple engineers to communicate externally during an incident which causes message conflicts.
Limitations
Structured incident processes add coordination overhead that may slow response for very simple issues. Severity classification depends on accurate impact assessment which may not be available in the early minutes of an incident. Post-incident reviews require dedicated time from participants and can be deprioritized against feature work.
More Skills You Might Like
Explore similar skills to enhance your workflow
Mermaid Diagrams
Generate complex Mermaid diagrams and automate documentation visual integration
Financial Analyst
Automate and integrate Financial Analyst tools for smarter financial insights and reporting
Model Pruning
Reduce model size and latency through automated pruning and optimization integration
Agencyzoom Automation
Automate Agencyzoom operations through Composio's Agencyzoom toolkit
Adaptyv
Automate Adaptyv subscription management and integrate revenue tracking into your mobile applications
Vitepress
Automate and integrate VitePress static site generation into your workflows