Test Flakiness Detection

Test Flakiness Detection

allowed-tools: Read, Glob, Grep, Write, Edit, Bash

Category: development Source: Donchitos/Claude-Code-Game-Studios

What Is Test Flakiness Detection?

Test Flakiness Detection is an automated skill designed to identify non-deterministic, or “flaky,” tests within a codebase by analyzing continuous integration (CI) run logs or historical test results. A flaky test is a test that alternates between passing and failing without any changes to the underlying code, making it unreliable. This skill systematically aggregates pass rates per test, identifies patterns of intermittent failure, and recommends whether each flaky test should be quarantined or fixed. It also maintains a registry of known flaky tests to support ongoing quality assurance efforts.

By integrating this skill into your quality workflow, you can proactively address one of the most common sources of CI unreliability. The primary outputs are updates to the quarantine section in tests/regression-suite.md and optional generation of detailed flakiness reports in the production/qa/ directory.

Why Use Test Flakiness Detection?

Flaky tests undermine the reliability of automated testing and CI pipelines. Teams often begin to ignore failing tests, assuming they are false alarms, which can result in actual defects being overlooked. Over time, this erodes trust in the test suite and slows down development velocity.

This skill provides several key benefits:

  • Systematic Detection: Aggregates and analyzes test run data to highlight tests with inconsistent results.
  • Actionable Insights: Offers recommendations on whether to quarantine or address flaky tests.
  • Registry Maintenance: Keeps a persistent record of flaky tests for historical tracking and easier remediation.
  • CI Health Improvement: Reduces noise in CI results, making it easier to spot genuine regressions.

By making the process of identifying and managing flaky tests reproducible and data-driven, you improve both team productivity and product quality.

How to Use Test Flakiness Detection

This skill can be invoked in several modes, each tailored for different use cases. The command structure is as follows:

/test-flakiness [ci-log-path | scan | registry]

1. Analyze a Specific CI Log File

To analyze a particular CI run log for flakiness:

/test-flakiness .github/workflows/ci-run-2024-06-01.log
  • The skill will parse the specified log file, extract test names and outcomes, and calculate flakiness metrics for each test.

2. Scan All Available CI Logs

To scan all logs in standard directories, such as .github/ or other CI output folders:

/test-flakiness scan
  • The skill will use file globbing to locate all relevant log files, aggregate results across multiple runs, and generate a comprehensive flakiness report.

3. Review the Flaky Test Registry

To review and get remediation advice for already-known flaky tests:

/test-flakiness registry
  • This reads the quarantine section in tests/regression-suite.md and provides targeted recommendations for each listed test.

Output

  • The quarantine section in tests/regression-suite.md is updated automatically to reflect the latest findings.
  • Optionally, a detailed report is created in production/qa/flakiness-report-[date].md summarizing flaky tests, their pass rates, and suggested actions.

Example of Flakiness Detection Output

## Quarantined Tests

| Test Name                  | Pass Rate | Last 10 Runs | Recommendation |
|----------------------------|----------|--------------|----------------|
| test_login_flow            | 70%      | P F P P F P P P F P | Quarantine   |
| test_payment_processing    | 90%      | P P F P P P P P P P | Fix          |

When to Use Test Flakiness Detection

  • During the Polish Phase: When the test suite has accumulated sufficient run history, allowing for statistically meaningful flakiness detection.
  • On Suspicion of Flakiness: If developers or reviewers notice inconsistent CI results and begin to doubt test reliability.
  • After Regression Suite Analysis: When /regression-suite identifies tests that are already quarantined and require further diagnosis.

Running flakiness detection too early, before tests have run multiple times, may result in false positives or insufficient data. For best results, ensure a substantial sample size from your CI logs.

Important Notes

  • Allowed Tools: This skill leverages standard tools such as Read, Glob, Grep, Write, Edit, and Bash for efficient log processing and registry maintenance. Ensure these tools are available in your environment.
  • Data Sensitivity: The skill relies on consistent test naming and log formatting. Custom or irregular test runners may require adaptation.
  • Statistical Thresholds: Flakiness is typically flagged when a test’s pass rate drops below a configurable threshold (such as 90 percent), but this can be adjusted based on project tolerance.
  • Registry Maintenance: Always review and update the quarantine section in tests/regression-suite.md to keep the flaky test list current.
  • Remediation Guidance: The skill not only identifies flaky tests but also categorizes them based on severity and frequency, offering actionable next steps: quarantine for highly unstable tests, or fix for those with minor flakiness.

By consistently applying Test Flakiness Detection, you protect your CI pipelines from the hidden costs of unreliable tests and foster a culture of quality and accountability.