Monitoring Expert

Implement advanced system monitoring and automated observability integration

Monitoring Expert is a community skill for designing and implementing comprehensive system monitoring solutions, covering metric collection, alerting strategies, dashboard design, log aggregation, and incident detection for infrastructure and application observability.

What Is This?

Overview

Monitoring Expert provides tools for building observability systems across infrastructure and applications. It covers metric collection that instruments services to emit measurements for CPU usage, memory consumption, request latency, and error rates, alerting strategies that define threshold-based and anomaly-driven rules to notify teams when systems deviate from expected behavior, dashboard design that creates visual displays combining multiple metric sources into actionable operational views, log aggregation that centralizes structured log data from distributed services for correlation and search, and incident detection that identifies degradation patterns before they escalate into full outages. The skill enables engineers to maintain reliable systems through proactive observability.

Who Should Use This

This skill serves site reliability engineers building monitoring infrastructure, DevOps teams establishing alerting and dashboard standards, and platform engineers designing observability pipelines for distributed systems.

Why Use It?

Problems It Solves

Production systems without proper monitoring suffer silent failures that go undetected until users report problems. Alert configurations that trigger too frequently cause fatigue while too-conservative thresholds miss genuine incidents. Scattered logs across multiple services make root-cause analysis slow and difficult during active outages. Dashboards without clear hierarchy overwhelm operators with data instead of surfacing actionable insights.

Core Highlights

Metric instrumentor configures collection pipelines for infrastructure and application telemetry data. Alert designer builds multi-level notification rules with escalation policies and quiet periods. Dashboard builder creates structured operational views with service health indicators. Log pipeline constructs centralized log aggregation with filtering and correlation capabilities.

How to Use It?

Basic Usage

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - 'alerts/*.yml'

scrape_configs:
  - job_name: 'api-server'
    metrics_path: /metrics
    static_configs:
      - targets:
        - 'api:8080'
    relabel_configs:
      - source_labels:
          [__address__]
        target_label:
          instance

  - job_name: 'node'
    static_configs:
      - targets:
        - 'node-exp:9100'

alerting:
  alertmanagers:
    - static_configs:
      - targets:
        - 'alertmgr:9093'
groups:
  - name: api-alerts
    rules:
      - alert:
          HighErrorRate
        expr: |
          rate(
            http_requests_total
            {status=~"5.."}[5m]
          ) / rate(
            http_requests_total
            [5m]
          ) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: >
            High error rate
            on {{ $labels.job }}
          description: >
            Error rate exceeds
            5 percent for 5
            minutes

      - alert:
          HighLatency
        expr: |
          histogram_quantile(
            0.95,
            rate(
              http_duration
              _seconds_bucket
              [5m]))
          > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: >
            P95 latency high
            on {{ $labels.job }}

Real-World Examples

from prometheus_client import (
  Counter, Histogram,
  Gauge, start_http_server)
import time

REQUEST_COUNT = Counter(
  'app_requests_total',
  'Total requests',
  ['method', 'endpoint',
   'status'])

REQUEST_LATENCY = Histogram(
  'app_request_duration',
  'Request duration seconds',
  ['endpoint'],
  buckets=[0.01, 0.05,
    0.1, 0.5, 1.0, 5.0])

ACTIVE_CONNS = Gauge(
  'app_active_connections',
  'Active connections')

class MetricsMiddleware:
  def __init__(self, app):
    self.app = app

  def __call__(
    self, environ,
    start_response
  ):
    method = environ[
      'REQUEST_METHOD']
    path = environ[
      'PATH_INFO']
    ACTIVE_CONNS.inc()
    start = time.time()
    try:
      resp = self.app(
        environ,
        start_response)
      REQUEST_COUNT.labels(
        method=method,
        endpoint=path,
        status='200'
      ).inc()
      return resp
    finally:
      duration = (
        time.time() - start)
      REQUEST_LATENCY\
        .labels(
          endpoint=path
        ).observe(duration)
      ACTIVE_CONNS.dec()

if __name__ == '__main__':
  start_http_server(8000)

Advanced Tips

Use recording rules to precompute expensive PromQL queries that dashboards reference frequently to reduce query load on the monitoring backend. Implement multi-window burn rate alerting for SLO-based alerts that balance detection speed with false positive reduction. Label metrics with consistent service and environment tags to enable cross-service correlation in dashboard queries.

When to Use It?

Use Cases

Set up Prometheus metric collection and Grafana dashboards for a microservices deployment. Design alerting rules with severity-based escalation for critical production services. Build a centralized logging pipeline that correlates request traces across distributed service boundaries.

Related Topics

Prometheus, Grafana, observability, alerting, dashboards, log aggregation, SRE, incident management, and infrastructure monitoring.

Important Notes

Requirements

Prometheus or compatible metrics backend for time-series data storage and querying. Alertmanager or equivalent notification routing service for alert delivery. Grafana or dashboard platform for metric visualization and operational views.

Usage Recommendations

Do: define service level objectives before configuring alerts to ensure thresholds reflect actual user impact. Use structured logging with consistent field names across services for effective log correlation. Design dashboards in layers from high-level service health down to detailed component metrics.

Don't: alert on every individual metric deviation since transient spikes are normal in distributed systems. Create dashboards with dozens of unrelated panels that lack clear operational narrative. Use monitoring data retention policies that exceed actual analysis needs adding unnecessary storage costs.

Limitations

Prometheus pull-based collection may not suit environments with short-lived containers that terminate before scrape intervals. High-cardinality labels on metrics create significant storage and query performance problems that compound over time. Alert rules based on static thresholds require tuning as system load patterns and traffic volumes evolve with growth.