Bigcode Evaluation Harness

Bigcode Evaluation Harness automation and integration

Source: Orchestra-Research/AI-Research-SKILLs

BigCode Evaluation Harness is a community skill for evaluating code generation models using standardized benchmarks, covering test execution, metric computation, multi-language support, and comparative analysis across code LLMs.

What Is This?

Overview

BigCode Evaluation Harness provides patterns for running code generation benchmarks that assess model performance on programming tasks. It covers benchmark setup for HumanEval, MBPP, and MultiPL-E datasets, sandboxed code execution for safety, pass@k metric computation, multi-language evaluation across Python, JavaScript, and other targets, and result comparison between models. The skill enables researchers to measure code generation quality using the community standard evaluation framework.

Who Should Use This

This skill serves researchers benchmarking new code generation models against established baselines, teams evaluating code LLMs for production use in development tools, and engineers building evaluation pipelines that track model quality over training checkpoints.

Why Use It?

Problems It Solves

Comparing code generation models requires running the same benchmarks with identical configurations, which is error-prone when done manually. Executing model-generated code without sandboxing creates security risks from untrusted outputs. Computing pass@k metrics correctly requires multiple samples per problem with specific statistical formulas. Testing across multiple programming languages multiplies the evaluation complexity for each model.

Core Highlights

Standardized benchmark loading provides consistent problem sets with test cases for fair comparison. Sandboxed execution runs generated code in isolated environments to prevent harmful side effects. Statistical metric computation calculates pass@k with correct unbiased estimators. Multi-language evaluation tests the same problems translated into different target programming languages.

How to Use It?

Basic Usage

from dataclasses import dataclass, field
import subprocess
import tempfile
from pathlib import Path

@dataclass
class EvalProblem:
    task_id: str
    prompt: str
    test_code: str
    entry_point: str
    language: str = "python"

@dataclass
class EvalResult:
    task_id: str
    passed: bool
    output: str = ""
    error: str = ""

class CodeEvaluator:
    def __init__(self, timeout: int = 10):
        self.timeout = timeout

    def execute_python(self, code: str, test: str) -> EvalResult:
        full_code = code + "\n" + test
        with tempfile.NamedTemporaryFile(mode="w", suffix=".py",
                                         delete=False) as f:
            f.write(full_code)
            f.flush()
            try:
                result = subprocess.run(
                    ["python3", f.name],
                    capture_output=True, text=True,
                    timeout=self.timeout
                )
                passed = result.returncode == 0
                return EvalResult(
                    task_id="", passed=passed,
                    output=result.stdout, error=result.stderr)
            except subprocess.TimeoutExpired:
                return EvalResult(
                    task_id="", passed=False, error="Timeout")

Real-World Examples

from dataclasses import dataclass, field
import math

@dataclass
class BenchmarkRunner:
    problems: list[EvalProblem] = field(default_factory=list)
    results: list[EvalResult] = field(default_factory=list)

    def run_all(self, generate_fn, evaluator: CodeEvaluator,
               samples_per_task: int = 1) -> dict:
        task_results: dict[str, list[bool]] = {}
        for problem in self.problems:
            outcomes = []
            for _ in range(samples_per_task):
                completion = generate_fn(problem.prompt)
                result = evaluator.execute_python(
                    completion, problem.test_code)
                result.task_id = problem.task_id
                outcomes.append(result.passed)
                self.results.append(result)
            task_results[problem.task_id] = outcomes
        return task_results

def compute_pass_at_k(n: int, c: int, k: int) -> float:
    if n - c < k:
        return 1.0
    return 1.0 - math.prod(
        (n - c - i) / (n - i) for i in range(k))

def aggregate_metrics(task_results: dict[str, list[bool]],
                      k_values: list[int]) -> dict:
    metrics = {}
    for k in k_values:
        scores = []
        for outcomes in task_results.values():
            n = len(outcomes)
            c = sum(outcomes)
            scores.append(compute_pass_at_k(n, c, k))
        metrics[f"pass@{k}"] = round(
            sum(scores) / len(scores) * 100, 2)
    return metrics

Advanced Tips

Use Docker containers for code execution sandboxing to isolate generated code from the host system. Generate multiple samples per problem at different temperatures to compute pass@k with statistical significance. Cache generated completions to enable re-evaluation against updated test suites without regenerating samples.

When to Use It?

Use Cases

Benchmark a fine-tuned code model against base models on HumanEval and MBPP to measure improvement. Evaluate multi-language code generation ability using MultiPL-E translated benchmarks. Track code generation quality across training checkpoints to identify optimal stopping points.

Important Notes

Requirements

Benchmark datasets such as HumanEval or MBPP in the expected format. A sandboxed execution environment for running generated code safely. Language runtimes installed for each target programming language under evaluation.

Usage Recommendations

Do: use the unbiased pass@k estimator with sufficient samples for statistically meaningful results. Run evaluations in isolated containers to prevent generated code from affecting the host system. Report generation parameters including temperature and sampling method alongside benchmark scores.

Don't: compare pass@k scores computed with different numbers of samples per task, as the estimates are not directly comparable. Execute generated code outside a sandbox even for seemingly safe benchmarks. Report only the best score from multiple evaluation runs without mentioning variance.

Limitations

Benchmark performance does not fully predict real-world code generation utility for diverse programming tasks. Test cases in benchmarks may not cover edge cases, allowing incorrect solutions to pass. Sandboxed execution adds overhead that limits evaluation throughput for large-scale benchmark runs.

More Skills You Might Like

Explore similar skills to enhance your workflow