Benchmark

Benchmark

Detects performance regressions by measuring page load times and Core Web Vitals

Category: development Source: garrytan/gstack

What Is This?

Overview

The Benchmark skill provides automated performance regression detection for web applications by leveraging the browse daemon to measure, record, and compare key performance indicators across code changes. It establishes reliable baselines for page load times, Core Web Vitals, and resource sizes, then runs comparisons on every pull request to catch regressions before they reach production. This systematic approach transforms performance monitoring from a manual, ad-hoc process into a consistent part of the development workflow.

At its core, Benchmark integrates with Lighthouse and related tooling to capture metrics such as First Contentful Paint, Largest Contentful Paint, Cumulative Layout Shift, and Total Blocking Time. These measurements are stored as baselines and compared against new builds, producing clear before-and-after reports that developers can review alongside code changes. The skill also tracks performance trends over time, giving teams visibility into gradual degradation that might otherwise go unnoticed.

This skill is part of the garrytan/gstack toolchain and operates through a set of allowed tools including Bash, Read, Write, Glob, and AskUserQuestion. It is designed to run in CI pipelines as well as local development environments, making it flexible enough to fit into most modern workflows.

Who Should Use This

  • Frontend developers who need to verify that new features or refactors do not introduce page speed regressions
  • DevOps and platform engineers responsible for maintaining CI/CD pipelines and enforcing performance budgets
  • Engineering leads and tech leads who want objective, data-driven performance tracking across a team
  • Full-stack developers working on applications where bundle size and load time directly affect user experience
  • QA engineers who include performance validation as part of their testing criteria
  • Product teams that track Core Web Vitals as part of SEO and user experience goals

Why Use It?

Problems It Solves

  • Manual performance testing is inconsistent and easy to skip under deadline pressure, leading to regressions that accumulate undetected over time
  • Without baseline comparisons, developers cannot tell whether a code change improved or degraded performance
  • Bundle size growth often happens gradually across many PRs, making it difficult to identify which change caused a significant increase
  • Lighthouse audits run in isolation provide a snapshot but no historical context, limiting their usefulness for trend analysis
  • Teams lack a shared, automated source of truth for performance metrics, causing disagreements about whether a build is acceptable

Core Highlights

  • Automated baseline establishment for page load times and Core Web Vitals
  • Per-PR before-and-after performance comparisons integrated into the review process
  • Bundle size tracking to catch asset bloat early
  • Lighthouse score monitoring across multiple performance categories
  • Historical trend tracking to identify gradual degradation
  • Browse daemon integration for consistent, reproducible measurement conditions
  • Support for both CI pipeline execution and local development runs
  • Configurable performance budgets with pass-or-fail thresholds

How to Use It?

Basic Usage

To run a benchmark against a local development server, use the following command pattern:

benchmark run --url http://localhost:3000 --output ./reports/baseline.json

To compare a new build against an existing baseline:

benchmark compare --baseline ./reports/baseline.json --url http://localhost:3000 --output ./reports/comparison.json

Specific Scenarios

Scenario 1: PR Performance Gate Configure the benchmark to run automatically on pull requests by adding it to your CI workflow. The comparison step reads the stored baseline and fails the build if any Core Web Vital exceeds the defined threshold.

benchmark compare --baseline ./perf/baseline.json --url $PREVIEW_URL --fail-on-regression --threshold 10

Scenario 2: Bundle Size Audit Track JavaScript and CSS bundle sizes separately from runtime metrics to catch build-time regressions.

benchmark bundle --dist ./dist --baseline ./perf/bundle-baseline.json --max-increase 5

Real-World Examples

A team shipping a new image carousel component runs the benchmark before and after the change. The comparison report shows a 200ms increase in Largest Contentful Paint, prompting the developer to optimize image loading before merging.

A platform team sets a performance budget of 90 for the Lighthouse performance score. Any PR that drops the score below this threshold is automatically flagged, keeping the team accountable without requiring manual review.

Important Notes

Requirements

  • Node.js environment with access to the browse daemon
  • A running application server or accessible preview URL for measurement
  • Baseline report files stored in a location accessible to both local and CI environments