Sre Engineer
Site Reliability Engineering automation and integration to enhance system stability and operational efficiency
SRE Engineer is a community skill for site reliability engineering practices, covering incident management, service level objectives, monitoring and alerting, capacity planning, and toil reduction for maintaining reliable production systems.
What Is This?
Overview
SRE Engineer provides guidance on implementing site reliability engineering practices that balance reliability with feature velocity. It covers incident management that structures response with escalation paths and postmortems, service level objectives that define reliability targets using SLIs and error budgets, monitoring that detects anomalies through actionable alerts, capacity planning that forecasts resource needs based on growth trends, and toil reduction that automates repetitive tasks. The skill helps teams maintain reliable production systems.
Who Should Use This
This skill serves SRE teams managing production infrastructure, DevOps engineers implementing observability, and engineering managers defining reliability targets and error budgets.
Why Use It?
Problems It Solves
Services without SLOs lack reliability targets, making it impossible to balance reliability against feature development. Noisy alerts cause fatigue and missed incidents. Unstructured incident response leads to longer outages and repeated failures. Manual operational tasks consume time that could go toward improving systems.
Core Highlights
SLO manager defines and tracks reliability objectives with error budgets. Incident responder structures escalation and postmortem workflows. Alert tuner reduces noise while maintaining detection coverage. Toil automator identifies and eliminates repetitive manual operations.
How to Use It?
Basic Usage
from dataclasses import (
dataclass)
from datetime import (
datetime, timedelta)
@dataclass
class SLI:
name: str
good_events: int
total_events: int
@property
def ratio(self) -> float:
if self.total_events == 0:
return 1.0
return (
self.good_events
/ self.total_events)
class SLOTracker:
def __init__(
self,
target: float,
window_days: int = 30
):
self.target = target
self.window = (
timedelta(
days=window_days))
self.slis: list[SLI] = []
def record(
self, name: str,
good: int,
total: int
):
self.slis.append(
SLI(name, good,
total))
def error_budget(
self
) -> dict:
total_good = sum(
s.good_events
for s in self.slis)
total_all = sum(
s.total_events
for s in self.slis)
current = (
total_good
/ max(total_all, 1))
budget = (
current
- self.target)
return {
'current_sli':
round(current, 4),
'target': self.target,
'budget_remaining':
round(budget, 4),
'budget_pct': round(
budget
/ (1 - self.target)
* 100, 1)
if self.target < 1
else 0}
tracker = SLOTracker(
target=0.999)
tracker.record(
'api_latency',
good=99850,
total=100000)
budget = tracker\
.error_budget()
print(
f'Budget: '
f'{budget["budget_pct"]}%'
f' remaining')Real-World Examples
from dataclasses import (
dataclass, field)
from datetime import (
datetime)
from enum import Enum
class Severity(Enum):
P1 = 'critical'
P2 = 'major'
P3 = 'minor'
@dataclass
class Incident:
id: str
title: str
severity: Severity
start: datetime
end: datetime = None
actions: list = field(
default_factory=list)
@property
def duration_min(
self
) -> float:
end = self.end or (
datetime.now())
return (
(end - self.start)
.total_seconds()
/ 60)
class IncidentManager:
def __init__(self):
self.incidents = []
def open_incident(
self, id: str,
title: str,
severity: Severity
) -> Incident:
inc = Incident(
id, title, severity,
datetime.now())
self.incidents.append(
inc)
return inc
def resolve(
self, id: str
):
for inc in (
self.incidents
):
if inc.id == id:
inc.end = (
datetime.now())
return inc
def mttr(self) -> float:
resolved = [
i for i in
self.incidents
if i.end]
if not resolved:
return 0
return sum(
i.duration_min
for i in resolved
) / len(resolved)
mgr = IncidentManager()
inc = mgr.open_incident(
'INC-001',
'API timeout spike',
Severity.P2)
print(
f'Incident {inc.id}: '
f'{inc.severity.value}')Advanced Tips
Set SLOs based on user-facing impact rather than infrastructure metrics. Use error budget policies that freeze deployments when budgets are exhausted. Write runbooks for common incidents to reduce mean time to resolution.
When to Use It?
Use Cases
Define SLOs for a web service and track error budget consumption over rolling windows. Build an incident management workflow with severity classification and postmortem templates. Identify toil in operational tasks and automate the most time-consuming ones.
Related Topics
Site reliability engineering, SLOs, SLIs, error budgets, incident management, monitoring, alerting, and capacity planning.
Important Notes
Requirements
Monitoring infrastructure that collects metrics for SLI computation. Alerting system with configurable thresholds and notification routing. Incident tracking system for logging events and postmortem documentation.
Usage Recommendations
Do: define SLOs collaboratively between engineering and product teams. Review error budgets regularly to make informed decisions about reliability versus feature work. Conduct blameless postmortems to improve resilience after incidents.
Don't: set SLOs at 100 percent since this leaves no error budget for releases and experiments. Alert on every metric deviation since this creates noise that masks real problems. Skip postmortems for minor incidents since recurring small issues often signal systemic problems.
Limitations
SLO compliance depends on accurate measurement infrastructure that may have its own gaps. Error budgets assume stationary traffic and may need adjustment during growth. Toil measurement is subjective and requires team consensus on what qualifies as automatable work.
More Skills You Might Like
Explore similar skills to enhance your workflow
Day2 Supplement Mcp
Automate and integrate Day 2 supplemental MCP workflows to reinforce and extend onboarding processes
Lessonspace Automation
Automate Lessonspace tasks via Rube MCP (Composio)
Filesystem Context
Automate and integrate Filesystem Context to manage and access file system data efficiently
Google Maps Automation
Geocode addresses, search places, get directions, compute distance
Tools Ui
Automate and integrate Tools UI for seamless user interface management and workflows
Tauri V2
Build and integrate lightweight desktop apps using Tauri V2 automation workflows