Debugging and Error Recovery

2. PRESERVE evidence (error output, logs, repro steps)

What Is Debugging and Error Recovery?

Debugging and Error Recovery is a systematic approach to identifying, analyzing, and resolving unexpected behaviors, errors, or failures in software systems. The skill focuses on preserving critical evidence, following structured investigation steps, and preventing recurrence of the same issue. Rather than relying on guesswork or ad-hoc fixes, this method emphasizes a disciplined, evidence-driven process that enables engineers to find and fix the true root cause of a problem. This skill is essential for maintaining software quality, minimizing downtime, and ensuring reliable releases.

Why Use Debugging and Error Recovery?

When software fails-whether through a broken build, a test failure, or unexpected runtime behavior-there is significant risk in continuing development without first addressing the issue. Overlooking errors can introduce further defects, waste engineering time, and erode user trust. Debugging and Error Recovery provides a well-defined process that:

Stops the spread and compounding of errors.
Preserves crucial information (such as logs and error outputs) before they are lost or overwritten.
Reduces time spent on guesswork by guiding engineers directly toward the root cause.
Prevents the same errors from recurring by implementing guards and verifications.
Establishes a repeatable protocol for responding to any type of software failure.

By applying these principles, teams can maintain momentum without sacrificing stability or increasing technical debt.

How to Use Debugging and Error Recovery

The core of this skill is the Stop-the-Line Rule and a triage checklist that structures the debugging process. This section outlines the recommended procedure.

The Stop-the-Line Rule

Whenever you encounter unexpected behavior or a failure, follow these steps in strict order:

STOP all feature work and unrelated changes.
PRESERVE evidence: Immediately capture error outputs, logs, screenshots, and steps to reproduce the issue. Do not restart or clear logs until you have done this.
DIAGNOSE: Use a checklist to methodically identify the problem's cause.
FIX the root cause, not just the symptoms.
GUARD against recurrence by adding tests or monitoring.
RESUME development only after verifying the fix.

The Triage Checklist

Work through each of these steps without skipping:

1. Reproduce

Ensure the failure can be triggered reliably. If the issue is not reproducible, further analysis becomes ineffective.

## Example:

Running a failing test repeatedly
for i in {1..5}; do npm test -- test/failing-test.js; done

2. Preserve

Evidence

Before making any changes, capture all available information:

Error messages and stack traces
Log files
System state (e.g., configuration, environment variables)
Steps to reproduce the issue
Screenshots or video captures if applicable

## Example:

Copy logs to a safe location
cp /var/log/app/error.log ~/debug-snapshots/

3. Isolate the

Failure

Narrow down the context in which the error occurs. Determine if the problem is specific to a subsystem, input, environment, or recent change.

Comment out or disable unrelated modules.
Test in a clean environment.
Use git bisect to identify problematic commits.

## Example:

Using git bisect to find the commit that introduced a bug
git bisect start
git bisect bad HEAD
git bisect good v1.2.0

4. Diagnose the Root

Cause

Analyze the preserved evidence and isolated context to form hypotheses. Use debugging tools, code inspection, or additional logging to test these hypotheses. Avoid making assumptions without supporting data.

// Example: Adding temporary logging
console.log('User object at login:', user);

5. Fix and

Verify

Once you identify the root cause, implement a targeted fix. Do not settle for workarounds that only mask the symptoms. Verify the fix by reproducing the original failure and confirming it no longer occurs.

6. Guard Against

Recurrence

Add automated tests, monitoring, or linting rules to catch similar issues in the future. Document lessons learned if the issue was non-obvious.

// Example: Adding a regression test
it('should not throw when user is null', () => {
  expect(() => login(null)).not.toThrow();
});

7. Resume

Development

Only after all verifications pass and guards are in place should regular development resume.

When to Use This Skill

Debugging and Error Recovery is essential in the following scenarios:

A test starts failing after a code change.
The build process fails or produces unexpected artifacts.
Application behavior does not match requirements or expectations.
A user or stakeholder reports a bug.
New errors appear in logs or system monitoring tools.
A feature or workflow that previously worked now fails.

In all these cases, applying this skill prevents deeper issues and wasted effort.

Important Notes

Never skip the evidence preservation step. Overwriting logs or hastily restarting systems can destroy valuable troubleshooting data.
Avoid guessing or speculating without evidence. Every step should be justified by the data collected.
Use automation for repetitive checks-such as rerunning tests or collecting diagnostics.
Communicate findings and resolutions with your team to build collective knowledge and avoid repeated mistakes.
Always confirm that the fix addresses the root cause and not just the visible symptom.

By consistently applying Debugging and Error Recovery, teams can sustain high software quality and resilience even as complexity grows.

More Skills You Might Like

Explore similar skills to enhance your workflow