Chaos Engineer

Automate fault injection and resilience testing to ensure system stability under unpredictable conditions

Chaos Engineer is a community skill for implementing chaos engineering experiments on distributed systems, covering failure injection, steady-state hypothesis definition, blast radius control, experiment automation, and resilience validation for production and staging environments.

What Is This?

Overview

Chaos Engineer provides patterns for systematically testing system resilience through controlled failure injection. It covers failure injection that introduces latency, errors, and resource exhaustion into system components, steady-state hypothesis that defines expected system behavior metrics before and during experiments, blast radius control that limits experiment scope to prevent widespread outages, experiment automation that schedules and runs chaos tests with automatic rollback on safety violations, and resilience validation that verifies systems recover correctly after failure conditions are removed. The skill enables teams to discover weaknesses before real failures expose them in production.

Who Should Use This

This skill serves SRE teams validating system resilience against failure scenarios, platform engineers building fault-tolerant distributed systems, and DevOps teams preparing for production incident readiness.

Why Use It?

Problems It Solves

Distributed systems fail in unexpected ways that unit and integration tests cannot predict. Retry and fallback logic is rarely tested against real failure conditions. Capacity limits and timeout configurations are often set based on estimates rather than measured behavior. Incident response processes are untested until a real outage occurs.

Core Highlights

Failure injector introduces network latency, process kills, and disk pressure on target components. Hypothesis checker monitors metrics to validate steady-state conditions during experiments. Blast limiter constrains experiments to specific services and traffic percentages. Auto-rollback restores normal conditions if safety thresholds are breached.

How to Use It?

Basic Usage

from dataclasses\
  import dataclass
import subprocess
import time

@dataclass
class Experiment:
  name: str
  target: str
  action: str
  duration: int
  metric: str
  threshold: float

class ChaosRunner:
  def __init__(
    self,
    monitor_url: str
  ):
    self.monitor_url =\
      monitor_url

  def run(
    self,
    exp: Experiment
  ) -> dict:
    baseline = self\
      ._check_metric(
        exp.metric)
    self._inject(
      exp.target,
      exp.action)

    try:
      time.sleep(
        exp.duration)
      during = self\
        ._check_metric(
          exp.metric)
    finally:
      self._rollback(
        exp.target,
        exp.action)

    return {
      'name': exp.name,
      'baseline': baseline,
      'during': during,
      'passed': during
        <= exp.threshold}

  def _inject(
    self,
    target: str,
    action: str
  ):
    subprocess.run([
      'tc', 'qdisc', 'add',
      'dev', 'eth0', 'root',
      'netem', 'delay',
      '200ms'])

  def _rollback(
    self,
    target: str,
    action: str
  ):
    subprocess.run([
      'tc', 'qdisc', 'del',
      'dev', 'eth0', 'root'])

  def _check_metric(
    self, metric: str
  ) -> float:
    import requests
    resp = requests.get(
      f'{self.monitor_url}'
      f'/query',
      params={
        'query': metric})
    return float(
      resp.json()[
        'data']['result']
        [0]['value'][1])

Real-World Examples

experiments = [
  Experiment(
    name='api-latency',
    target='api-gateway',
    action='latency-200ms',
    duration=120,
    metric='http_p99_ms',
    threshold=500),
  Experiment(
    name='db-failover',
    target='postgres-primary',
    action='kill-process',
    duration=60,
    metric='error_rate',
    threshold=0.01),
]

runner = ChaosRunner(
  'http://prometheus:9090')
results = []
for exp in experiments:
  result = runner.run(exp)
  results.append(result)
  print(
    f'{result["name"]}: '
    f'{"PASS" if result'
    f'["passed"] else "FAIL"}')

Advanced Tips

Start chaos experiments in staging environments and graduate to production only after building confidence. Monitor multiple system health metrics simultaneously during experiments to catch cascading failures. Automate experiment rollback with circuit breakers that trigger when error rates exceed safe thresholds.

When to Use It?

Use Cases

Validate that a microservice handles upstream dependency failures with proper fallback responses. Test database failover by killing the primary instance and measuring recovery time. Run periodic resilience tests in staging as part of pre-release validation.

Related Topics

Chaos engineering, site reliability, fault injection, resilience testing, and distributed systems.

Important Notes

Requirements

Monitoring infrastructure for metric collection during experiments. Administrative access to target systems for failure injection. Rollback mechanisms for all failure types being injected. Team agreement on experiment scope and acceptable risk levels before running tests.

Usage Recommendations

Do: define clear abort conditions before starting any experiment. Communicate experiment schedules to on-call teams to avoid false incident escalation. Document findings from each experiment for knowledge sharing.

Don't: run chaos experiments in production without management approval and team awareness. Inject failures without automatic rollback capability in place. Target critical data stores without verified backup and recovery procedures.

Limitations

Chaos experiments test known failure modes but cannot predict novel failure combinations. Production experiments carry inherent risk of user-visible impact despite blast radius controls. Network-level injection requires privileged access that may not be available in managed cloud environments. Experiment results from staging may not reflect production behavior due to differences in traffic volume and data patterns.