Evaluation Methodology

This document is the authoritative reference for how PluginEval measures plugin and skill quality

What Is This

The "Evaluation Methodology" skill provides an authoritative, systematic framework for measuring and interpreting the quality of plugins and skills within the PluginEval environment on the Happycapy Skills platform. This methodology is essential for ensuring that plugins and skills meet high standards for performance, reliability, and usability. It consolidates a multi-layered evaluation process, breaking down quality into ten distinct scoring dimensions, and offers clear rubrics, statistical methods, and composite scoring formulas. The methodology also introduces anti-pattern detection, rank ordering (using Elo), and actionable guidance for improvement. This skill is critical for those who need a transparent, data-driven approach to plugin evaluation, from skill developers to platform maintainers and marketplace curators.

Why Use It

Using a robust evaluation methodology is crucial for several reasons:

  • Consistency: By applying standardized criteria and scoring rubrics, PluginEval ensures all plugins and skills are measured fairly and comparably.
  • Actionable Feedback: The system highlights specific areas for improvement, such as triggering accuracy or orchestration robustness, enabling continuous quality enhancement.
  • Marketplace Readiness: Thresholds and badges derived from this methodology help signal quality to users and partners, building trust and facilitating marketplace decisions.
  • Interpretable Results: Detailed breakdowns and statistical methods provide clarity when interpreting low scores or understanding the rationale behind composite ratings.
  • Automation-Friendly: The multi-layered approach, especially static analysis, enables rapid, automated assessments that scale across large plugin ecosystems.

How to Use It

The evaluation methodology operates across three primary layers, each contributing to the overall quality score for a plugin or skill. Here’s how the process works in practice:

Layer 1:

Static Analysis

Static analysis is fast (typically under 2 seconds), deterministic, and does not require LLM calls. It inspects the parsed SKILL.md file, running several sub-checks. For example, it evaluates the presence and quality of frontmatter, documentation of orchestration wiring, and the use of code blocks.

Example: Static Analysis Check

def check_frontmatter_quality(skill_md):
    # Ensures the 'name' and 'description' fields are present and well-formed
    if not skill_md.get('name') or not skill_md.get('description'):
        return 0.0
    if len(skill_md['description']) < 30:
        return 0.5
    return 1.0

Layer 2:

LLM-Based Evaluation

This layer employs large language models to perform nuanced analysis of skill behavior and documentation. It assesses dimensions like intent clarity, language specificity, and anti-pattern detection. The results from this layer can override or blend with static scores based on per-dimension weights.

Example: LLM Evaluation Prompt

prompt = f"Evaluate the intent clarity of the following skill documentation:\n\n{skill_md['description']}"
llm_score = llm.evaluate(prompt)

Layer 3:

Empirical Testing

Empirical testing runs the skill in real or simulated environments, measuring actual triggering accuracy, output correctness, and other real-world performance indicators. This layer provides the most direct evidence of quality and can supersede prior scores.

Example: Triggering Accuracy Test

def empirical_trigger_test(skill, test_cases):
    correct = sum(1 for case in test_cases if skill.trigger(case))
    return correct / len(test_cases)

Composite Scoring and Badges

Each dimension’s score is blended across layers using defined weights, producing a final score per dimension (ranging from 0.0 to 1.0). The overall quality score is a composite of these dimension scores, which then determines badge eligibility (such as “Gold” or “Neon-Ready”).

Example: Composite Score Calculation

def blend_scores(static, llm, empirical, weights=(0.3, 0.3, 0.4)):
    return static * weights[0] + llm * weights[1] + empirical * weights[2]

When to Use It

The evaluation methodology skill should be used in several situations:

  • Interpreting PluginEval Results: When reviewing why a plugin or skill received a particular score or badge, especially on a specific dimension.
  • Improving Skills: Use the dimension-specific feedback to enhance triggering accuracy, orchestration, or documentation.
  • Calibrating Marketplace Standards: When setting or adjusting quality thresholds for marketplace listings.
  • Explaining Quality to Partners: For communicating the meaning of scores and badges to external parties, such as integration partners like Neon.
  • Preventing Anti-Patterns: To identify and resolve common design or implementation mistakes flagged by the system.

Important Notes

  • Dimension-Specific Scoring: Each of the ten dimensions (such as intent clarity, orchestration fitness, and anti-pattern resistance) has its own rubric and weight, making it essential to address deficiencies on a per-dimension basis.
  • Layer Blending: Later evaluation layers can override earlier ones, especially if empirical results contradict static or LLM-based assessments.
  • Transparency: All scoring logic, including rubrics and formulas, is open and documented for auditability.
  • Actionable Guidance: The methodology not only scores but also provides tips for improvement, supporting skill authors in raising their quality.
  • Elo Ranking: Beyond absolute scores, skills are also ranked using an Elo system, providing a dynamic quality leaderboard.
  • Anti-Pattern Flags: Skills exhibiting known anti-patterns can be flagged, lowering their overall rating and prompting targeted remediation.

For full rubric anchors, refer to the rubric reference documentation. For source code and additional resources, see the PluginEval repository.