Self Eval

Honestly evaluate AI work quality using a two-axis scoring system. Use after completing a task, code review, or work session to get an unbiased assess

What Is Self Eval?

Self Eval is an advanced Claude Code skill designed to bring transparent, honest assessment to AI-assisted software development workflows. By introducing a structured two-axis evaluation system, Self Eval pushes past the default AI tendency to rate all outputs as "good enough" (typically 4 out of 5) and instead provides nuanced, actionable feedback on both the ambition of a task and the quality of its execution. The tool is entirely prompt-based, requiring no external dependencies, and is aimed at developers and teams seeking an unbiased, data-driven approach to evaluating their work, code reviews, or entire work sessions.

Why Use Self Eval?

Traditional AI self-assessment often falls prey to score inflation, where the majority of work is rated at an above-average level, regardless of actual quality or task complexity. This undermines the value of automated code reviews and retrospectives, making it difficult to identify areas for improvement or to track progress over time. Self Eval addresses this by:

  • Separating ambition from execution: A simple one-dimensional score conflates how challenging a task was with how well it was completed. Self Eval’s two-axis system ensures that a technically simple but flawlessly executed task is not rated the same as a complex but poorly executed one.
  • Enforcing critical reasoning: By requiring the AI to argue both for and against a score (the "devil’s advocate" approach), Self Eval surfaces potential blind spots and encourages more rigorous analysis.
  • Detecting inflation: By persisting and analyzing scores across sessions, Self Eval can flag trends of unwarranted optimism, prompting users to recalibrate their standards.

Ultimately, Self Eval is designed to drive continuous improvement and honest reflection, critical components for high-performing engineering teams.

How to Get Started

Self Eval is lightweight and straightforward to integrate into any Claude Code workflow. Here’s how to begin:

  1. Download or reference the skill from its repository: Self Eval on GitHub.
  2. Add the prompt to your preferred Claude interface or code assistant environment.
  3. Trigger the skill after completing a development task, code review, or work session by providing a summary of the work performed.
  4. Review the output: Self Eval will prompt for task ambition and execution quality assessments, perform mandatory devil’s advocate reasoning, and generate a final combined score.

Example Usage:

## Example:

Using Self Eval after a code review session

## Prompt to Claude:
"""
Apply Self Eval to this completed code review:
- Refactored data processing pipeline for clarity and speed.
- Improved test coverage from 70% to 95%.
- Fixed two major performance bottlenecks.

Please provide an honest assessment using the two-axis system.
"""

Self Eval will then prompt for task ambition (Low/Medium/High) and execution quality (Poor/Adequate/Strong), perform devil's advocate analysis, persist the score, and flag any inflation patterns.

Key Features

Self Eval offers several unique features to ensure robust, unbiased evaluation:

  • Two-Axis Scoring:

    • Ambition: Assesses the complexity or challenge of the task (Low, Medium, High).
    • Execution: Rates the quality of the outcome (Poor, Adequate, Strong).
    • Matrix Combination: These orthogonal axes are combined via a fixed lookup matrix that is hardcoded and cannot be overridden, ensuring consistent scoring logic.
  • Mandatory Devil’s Advocate Reasoning:
    Before submitting a final score, Self Eval forces a critical review by requiring arguments for a higher and lower score. This exposes potential over- or under-estimation and leads to a more balanced assessment.

  • Score Persistence:
    Evaluations are appended to a .self-eval-scores.jsonl file in the current working directory, providing a historical record of performance and enabling trend analysis.

  • Anti-Inflation Detection:
    By reviewing previous scores, Self Eval can identify and flag patterns of inflated ratings, urging recalibration when necessary.

Example Matrix (Simplified):

AmbitionExecutionCombined Score
LowStrong3
MediumAdequate3
HighStrong5

Best Practices

To maximize the value of Self Eval:

  • Use after meaningful units of work: Apply Self Eval after each significant task, code review, or work session to maintain consistent, high-resolution feedback.
  • Encourage transparency: Share evaluation results with your team to foster a culture of honest reflection and improvement.
  • Regularly review score trends: Use the persisted .self-eval-scores.jsonl file to identify areas of growth or recurring issues.
  • Embrace the devil’s advocate step: Take the challenge seriously; honest arguments on both sides yield more accurate self-assessment.
  • Avoid score gaming: Trust the matrix and resist the urge to manipulate ambition or execution ratings for a desired combined score.

Important Notes

  • Prompt-Only: Self Eval is implemented entirely via prompts—no API calls, plugins, or external libraries are required.
  • Score Matrix Is Fixed: The combination logic for ambition and execution is hardcoded to prevent circumvention or bias.
  • Historical Data: Score history is essential for inflation detection. Ensure the .self-eval-scores.jsonl file persists across sessions.
  • Human Oversight Recommended: While Self Eval enhances objectivity, critical human review remains necessary for truly high-stakes evaluations.
  • MIT Licensed: The skill is open-source and free to use or adapt within your projects.

By embedding Self Eval into your development process, you gain a calibrated, honest, and continuously improving view of your engineering work quality.