Chaos Engineer
Automate fault injection and resilience testing to ensure system stability under unpredictable conditions
Chaos Engineer is a community skill for implementing chaos engineering experiments on distributed systems, covering failure injection, steady-state hypothesis definition, blast radius control, experiment automation, and resilience validation for production and staging environments.
What Is This?
Overview
Chaos Engineer provides patterns for systematically testing system resilience through controlled failure injection. It covers failure injection that introduces latency, errors, and resource exhaustion into system components, steady-state hypothesis that defines expected system behavior metrics before and during experiments, blast radius control that limits experiment scope to prevent widespread outages, experiment automation that schedules and runs chaos tests with automatic rollback on safety violations, and resilience validation that verifies systems recover correctly after failure conditions are removed. The skill enables teams to discover weaknesses before real failures expose them in production.
Who Should Use This
This skill serves SRE teams validating system resilience against failure scenarios, platform engineers building fault-tolerant distributed systems, and DevOps teams preparing for production incident readiness.
Why Use It?
Problems It Solves
Distributed systems fail in unexpected ways that unit and integration tests cannot predict. Retry and fallback logic is rarely tested against real failure conditions. Capacity limits and timeout configurations are often set based on estimates rather than measured behavior. Incident response processes are untested until a real outage occurs.
Core Highlights
Failure injector introduces network latency, process kills, and disk pressure on target components. Hypothesis checker monitors metrics to validate steady-state conditions during experiments. Blast limiter constrains experiments to specific services and traffic percentages. Auto-rollback restores normal conditions if safety thresholds are breached.
How to Use It?
Basic Usage
from dataclasses\
import dataclass
import subprocess
import time
@dataclass
class Experiment:
name: str
target: str
action: str
duration: int
metric: str
threshold: float
class ChaosRunner:
def __init__(
self,
monitor_url: str
):
self.monitor_url =\
monitor_url
def run(
self,
exp: Experiment
) -> dict:
baseline = self\
._check_metric(
exp.metric)
self._inject(
exp.target,
exp.action)
try:
time.sleep(
exp.duration)
during = self\
._check_metric(
exp.metric)
finally:
self._rollback(
exp.target,
exp.action)
return {
'name': exp.name,
'baseline': baseline,
'during': during,
'passed': during
<= exp.threshold}
def _inject(
self,
target: str,
action: str
):
subprocess.run([
'tc', 'qdisc', 'add',
'dev', 'eth0', 'root',
'netem', 'delay',
'200ms'])
def _rollback(
self,
target: str,
action: str
):
subprocess.run([
'tc', 'qdisc', 'del',
'dev', 'eth0', 'root'])
def _check_metric(
self, metric: str
) -> float:
import requests
resp = requests.get(
f'{self.monitor_url}'
f'/query',
params={
'query': metric})
return float(
resp.json()[
'data']['result']
[0]['value'][1])Real-World Examples
experiments = [
Experiment(
name='api-latency',
target='api-gateway',
action='latency-200ms',
duration=120,
metric='http_p99_ms',
threshold=500),
Experiment(
name='db-failover',
target='postgres-primary',
action='kill-process',
duration=60,
metric='error_rate',
threshold=0.01),
]
runner = ChaosRunner(
'http://prometheus:9090')
results = []
for exp in experiments:
result = runner.run(exp)
results.append(result)
print(
f'{result["name"]}: '
f'{"PASS" if result'
f'["passed"] else "FAIL"}')Advanced Tips
Start chaos experiments in staging environments and graduate to production only after building confidence. Monitor multiple system health metrics simultaneously during experiments to catch cascading failures. Automate experiment rollback with circuit breakers that trigger when error rates exceed safe thresholds.
When to Use It?
Use Cases
Validate that a microservice handles upstream dependency failures with proper fallback responses. Test database failover by killing the primary instance and measuring recovery time. Run periodic resilience tests in staging as part of pre-release validation.
Related Topics
Chaos engineering, site reliability, fault injection, resilience testing, and distributed systems.
Important Notes
Requirements
Monitoring infrastructure for metric collection during experiments. Administrative access to target systems for failure injection. Rollback mechanisms for all failure types being injected. Team agreement on experiment scope and acceptable risk levels before running tests.
Usage Recommendations
Do: define clear abort conditions before starting any experiment. Communicate experiment schedules to on-call teams to avoid false incident escalation. Document findings from each experiment for knowledge sharing.
Don't: run chaos experiments in production without management approval and team awareness. Inject failures without automatic rollback capability in place. Target critical data stores without verified backup and recovery procedures.
Limitations
Chaos experiments test known failure modes but cannot predict novel failure combinations. Production experiments carry inherent risk of user-visible impact despite blast radius controls. Network-level injection requires privileged access that may not be available in managed cloud environments. Experiment results from staging may not reflect production behavior due to differences in traffic volume and data patterns.
More Skills You Might Like
Explore similar skills to enhance your workflow
Customer.io Automation
1. Connect your Customer.io account through the Composio MCP server
Heygen Automation
Automate AI video generation, avatar browsing, template-based video
Dropbox Sign Automation
Automate Dropbox Sign tasks via Rube MCP (Composio)
Awq
Automate and integrate AWQ model quantization into your AI pipelines
Landbot Automation
Automate Landbot operations through Composio's Landbot toolkit via Rube
Bugsnag Automation
Automate Bugsnag operations through Composio's Bugsnag toolkit via Rube