Postmortem Writing
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence
Category: development Source: wshobson/agentsWhat Is This
Postmortem Writing is a structured skill designed to help teams write clear, effective, and blameless postmortems after incidents. This skill enables organizations to transform operational failures, outages, or near-misses into actionable insights that drive continuous improvement. By using a systematic approach, it ensures your incident reviews are comprehensive, focus on root causes, and lead to meaningful organizational learning rather than individual blame.
This skill guides you through every step of the postmortem process, from gathering data and constructing timelines to identifying contributing factors and assigning actionable follow-up tasks. It emphasizes a blameless culture, where the focus is on system improvements rather than individual mistakes.
Why Use It
Writing high-quality postmortems is essential for any organization that values reliability, learning, and transparency. Here’s why leveraging the Postmortem Writing skill is crucial:
- Promotes psychological safety: By focusing on systemic issues instead of blame, team members are more likely to share information openly, leading to better incident understanding.
- Uncovers root causes: The skill provides frameworks for root cause analysis, helping teams look beyond surface-level symptoms.
- Prevents incident recurrence: Actionable follow-up items ensure that lessons learned are translated into concrete improvements.
- Fosters organizational learning: Sharing comprehensive postmortems helps the entire organization learn from each incident, not just the directly involved team.
- Meets compliance and audit requirements: Well-documented postmortems demonstrate due diligence and can satisfy regulatory needs.
How to Use It
To use the Postmortem Writing skill effectively, follow this structured approach:
1. Initiate the Postmortem
Begin when an incident meets the established triggers (e.g., SEV1/SEV2 incident, customer outage, data loss, or near-miss). Gather all available incident data: logs, metrics, timelines, and communications.
2. Build the Timeline
Construct a chronological timeline of the incident. This should include key events, detections, escalations, mitigations, and resolutions. Use clear, objective language.
Example Timeline (in Markdown):
| Time | Event |
|-------------|----------------------------------------|
| 10:05 UTC | Customer reports service outage |
| 10:07 UTC | Monitoring detects database errors |
| 10:10 UTC | On-call engineer paged |
| 10:25 UTC | Database restarted |
| 10:45 UTC | Service restored |
| 11:05 UTC | Root cause identified |
3. Conduct Root Cause Analysis
Go beyond surface symptoms. Use tools like the "Five Whys" or fishbone diagrams to trace the chain of events that allowed the incident to occur.
Example Root Cause Analysis (in pseudocode):
def five_whys(issue):
for i in range(5):
reason = ask_why(issue)
if is_root_cause(reason):
return reason
issue = reason
return issue
root_cause = five_whys("Database outage")
print(f"Root cause: {root_cause}")
4. Write the Postmortem Document
Structure your postmortem to include:
- Summary: What happened, when, and what was the impact.
- Impact Assessment: Who/what was affected, for how long.
- Timeline: Chronological breakdown of events.
- Root Cause: Detailed analysis of contributing factors.
- Resolution: How the issue was resolved.
- Action Items: Concrete improvements, with owners and deadlines.
Postmortem Template (Markdown):
## Summary
Brief description of the incident.
## Impact
Describe affected services/users.
## Timeline
| Time | Event |
|------|-------|
| ... | ... |
## Root Cause
Explain the underlying causes.
## Resolution
Describe how the issue was fixed.
## Action Items
- [ ] Improve monitoring for database errors (Owner: Alice, Due: 2024-07-01)
- [ ] Add runbook for database failover (Owner: Bob, Due: 2024-07-05)
5. Facilitate a Blameless Review Meeting
Present the postmortem in a meeting focused on system improvements. Encourage open discussion and avoid blaming individuals.
When to Use It
- After SEV1 or SEV2 incidents
- Following customer-facing outages longer than 15 minutes
- In response to data loss or security breaches
- For near-misses that could have escalated
- When encountering novel or unexpected failure modes
- When incidents require unusual or manual intervention
Important Notes
- Blamelessness is critical: Focus analysis on system and process failures, not individual errors. This increases psychological safety and learning.
- Action items must be specific and assigned: Every postmortem should result in a set of actionable improvements, each with a clear owner and deadline.
- Timeliness matters: Postmortems should be drafted within 1-2 days of the incident and finalized within a week to ensure accuracy and relevance.
- Share widely: Make postmortems accessible to all relevant stakeholders to maximize organizational learning.
- Continuously improve: Regularly review and refine your postmortem process based on feedback and outcomes.
The Postmortem Writing skill provides the structure and guidance necessary for high-quality, blameless incident reviews that help your organization learn, improve, and build reliable systems.