Eval

Evaluate and rank agent results by metric or LLM judge for an AgentHub session

What Is Eval?

The Eval skill for Claude Code, available via the AgentHub project, is a flexible evaluation and ranking tool for multi-agent development workflows. Designed to streamline the assessment of agent-generated results, Eval supports both automated metric-based scoring and human-like LLM judging. This skill is particularly useful in collaborative, competitive, or benchmarking scenarios where agents produce solutions to a shared task and you need to determine which performed best according to objective or subjective criteria.

Eval operates within AgentHub sessions, where multiple agents contribute results on a common problem. It can evaluate these results using a user-defined metric (such as execution time, accuracy, or custom scripts) or by leveraging a language model (LLM) to judge more qualitative attributes like correctness, simplicity, and code quality. This dual-mode approach makes Eval a powerful addition to any agent-driven development pipeline.

Why Use Eval?

In modern AI-driven development environments, multiple agents or models often contribute to solving the same task. Comparing, ranking, and selecting the best outcome becomes essential, especially when automating code generation, bug fixing, or implementing features via agent teams. Manual inspection is both time-consuming and error-prone, while simple metrics might not fully capture the nuance of a good solution.

Eval addresses these challenges by:

  • Automating Evaluation: Quickly score agent results using reliable, repeatable metrics or LLM-based judgment.
  • Enforcing Fairness: Ensures all agent outputs are evaluated under the same conditions, making competition and benchmarking meaningful.
  • Supporting Hybrid Assessment: Combine strict numerical evaluation with subjective, context-aware judging for a fuller picture of solution quality.
  • Streamlining Collaboration: Facilitates transparent and data-driven decision making when selecting agent outputs for integration or deployment.

These capabilities are especially valuable for teams working on agentic workflows, continuous integration pipelines, or competitions such as hackathons and model benchmarks.

How to Get Started

Setting up Eval in your AgentHub environment is straightforward. The skill is invoked via the /hub:eval command, which supports several usage patterns:

  • Evaluate the Latest Session:

    /hub:eval

    This will evaluate the most recent AgentHub session using the evaluation method configured in your project.

  • Evaluate a Specific Session:

    /hub:eval 20260317-143022

    Replace 20260317-143022 with your session ID to target a specific set of agent results.

  • Force LLM Judge Mode:

    /hub:eval --judge

    This command skips any metric configuration and uses the LLM to rank agent outputs according to qualitative criteria.

Example:

Metric-Based Evaluation

Suppose you want to evaluate results by execution time. You would configure your evaluation command and metric direction (e.g., lower is better):

python agenthub/skills/eval/scripts/result_ranker.py \
  --session 20260317-143022 \
  --eval-cmd "python run.py --test" \
  --metric "runtime_ms" --direction "asc"

The results might look like:

RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

Example:

LLM Judge Mode

If no metric is configured, or you use --judge, Eval will:

  1. Compute diffs between each agent branch and the base branch.
  2. Read each agent’s result description.
  3. Present all changes to the LLM for ranking based on correctness, simplicity, and quality.

Key Features

  • Dual Evaluation Modes: Choose between metric-based (scripted) and LLM-based (qualitative) ranking.
  • Session Scoping: Evaluate all results from a specific AgentHub session for reproducibility.
  • Custom Metrics: Support for arbitrary shell commands as evaluation scripts.
  • Clear Output: Tabulated rankings showing agent, metric, delta from baseline, and files changed.
  • LLM Criteria: When using LLM judge mode, rankings are based on:
    • Correctness: Does it solve the task?
    • Simplicity: Fewer lines changed preferred, when correctness is equal.
    • Quality: Code structure, clarity, no regressions.

Best Practices

  • Define Objective Metrics Where Possible: For reproducible results, use explicit metrics such as runtime, accuracy, or pass/fail status. Configure your evaluation command to enforce these.
  • Use LLM Judging for Subjective Tasks: Where qualitative factors matter (e.g., code readability, design choices), LLM judging provides a knowledgeable, unbiased assessment.
  • Combine Approaches for Robustness: Use metric-based pre-filtering followed by LLM ranking for top entries to balance objectivity and nuance.
  • Document Evaluation Criteria: Make sure all agents and team members understand how evaluation will be performed to avoid disputes.
  • Review Diffs Carefully: In LLM mode, ensure that diffs are meaningful and do not include extraneous changes for a fair comparison.

Important Notes

  • Consistency Matters: Always use the same metric or judge configuration across runs when comparing agent performance over time.
  • LLM Limitations: While LLMs can provide insightful rankings, they may occasionally misinterpret diff context or project requirements. Always review results in high-stakes scenarios.
  • Session IDs: Ensure you are referencing the correct session ID when evaluating historical results.
  • Security: Only execute trusted evaluation commands in metric mode, as these run arbitrary scripts in agent worktrees.
  • Extensibility: Eval can be extended with custom scripts or integrated into CI/CD systems for fully automated agent competitions.

Eval bridges the gap between automated, quantitative assessment and qualitative, context-aware judgment, making it an indispensable tool for agent-driven development workflows in Claude Code and beyond.