Lm Evaluation Harness

Automate language model evaluation and integrate standardized benchmarking workflows

LM Evaluation Harness is a community skill for benchmarking language models using standardized evaluation suites, covering task configuration, few-shot evaluation, metric computation, and result comparison across model checkpoints.

What Is This?

Overview

LM Evaluation Harness provides patterns for running comprehensive language model benchmarks using the EleutherAI evaluation framework. It covers task selection from hundreds of built-in benchmarks, few-shot prompt configuration, batch evaluation across multiple tasks, metric aggregation and reporting, and custom task definition for domain-specific evaluation. The skill enables researchers to measure model capabilities systematically across standardized tasks.

Who Should Use This

This skill serves researchers comparing model performance across standard NLP benchmarks, teams evaluating fine-tuned models against base model baselines, and engineers building automated evaluation pipelines that track model quality over training iterations.

Why Use It?

Problems It Solves

Implementing individual benchmark evaluations from scratch is time-consuming and error-prone. Different evaluation implementations of the same benchmark can produce inconsistent scores due to subtle differences in prompt formatting. Comparing published results from different papers is unreliable when evaluation setups differ. Running evaluations across many tasks manually requires coordinating dataset loading, prompting, and metric computation.

Core Highlights

Standardized task implementations ensure consistent evaluation across different models and research groups. Few-shot configuration controls the number of examples included in prompts for each benchmark task. Batch evaluation runs multiple benchmarks in a single command with aggregated result reporting. Custom task registration enables adding domain-specific evaluations alongside standard benchmarks.

How to Use It?

Basic Usage

from dataclasses import dataclass, field

@dataclass
class EvalTask:
    name: str
    num_fewshot: int = 0
    metric: str = "accuracy"
    dataset: list[dict] = field(default_factory=list)

@dataclass
class EvalConfig:
    model_name: str
    tasks: list[EvalTask] = field(default_factory=list)
    batch_size: int = 8
    output_dir: str = "./eval_results"

class EvaluationHarness:
    def __init__(self, config: EvalConfig):
        self.config = config
        self.results: dict[str, dict] = {}

    def build_prompt(self, task: EvalTask,
                     examples: list[dict],
                     query: dict) -> str:
        parts = []
        for ex in examples[:task.num_fewshot]:
            parts.append(f"Q: {ex['question']}\nA: {ex['answer']}")
        parts.append(f"Q: {query['question']}\nA:")
        return "\n\n".join(parts)

    def evaluate_task(self, task: EvalTask,
                      predict_fn) -> dict:
        correct = 0
        total = len(task.dataset)
        for item in task.dataset:
            prompt = self.build_prompt(
                task, task.dataset[:task.num_fewshot], item)
            prediction = predict_fn(prompt)
            if prediction.strip() == item.get("answer", "").strip():
                correct += 1
        score = correct / max(total, 1)
        self.results[task.name] = {
            task.metric: round(score, 4), "total": total}
        return self.results[task.name]

Real-World Examples

from dataclasses import dataclass, field
import json
from pathlib import Path

@dataclass
class BenchmarkSuite:
    name: str
    tasks: list[EvalTask]
    results: dict[str, dict] = field(default_factory=dict)

class ModelComparator:
    def __init__(self):
        self.model_results: dict[str, dict[str, float]] = {}

    def add_result(self, model_name: str,
                   task_name: str, score: float):
        self.model_results.setdefault(model_name, {})
        self.model_results[model_name][task_name] = score

    def compare(self) -> list[dict]:
        rows = []
        for model, scores in self.model_results.items():
            avg = sum(scores.values()) / max(len(scores), 1)
            rows.append({"model": model, "scores": scores,
                         "average": round(avg, 4)})
        return sorted(rows, key=lambda x: x["average"],
                       reverse=True)

    def save_report(self, path: str):
        report = self.compare()
        Path(path).write_text(json.dumps(report, indent=2))

comparator = ModelComparator()
comparator.add_result("model-base", "hellaswag", 0.72)
comparator.add_result("model-base", "arc_easy", 0.68)
comparator.add_result("model-tuned", "hellaswag", 0.78)
comparator.add_result("model-tuned", "arc_easy", 0.74)
for row in comparator.compare():
    print(f"{row['model']}: avg={row['average']}")

Advanced Tips

Run evaluations with multiple few-shot counts to understand how model performance scales with in-context examples. Use model parallelism for large models that do not fit in single GPU memory during evaluation. Register custom tasks by defining prompt templates and answer extraction functions for domain-specific benchmarks.

When to Use It?

Use Cases

Benchmark a fine-tuned model against its base version across standard NLP tasks. Compare multiple model candidates on reasoning, knowledge, and language understanding benchmarks before selecting one for production. Track evaluation scores across training checkpoints to identify the optimal stopping point.

Related Topics

Language model benchmarking, few-shot evaluation methods, NLP task datasets, model comparison frameworks, and evaluation metric standardization.

Important Notes

Requirements

The lm-eval Python package with task dependencies installed. Model accessible through a supported backend such as Hugging Face or a local GGUF file. Sufficient GPU memory for loading the model and running inference on benchmark datasets.

Usage Recommendations

Do: report the exact evaluation configuration including few-shot count and prompt format alongside benchmark scores. Use consistent configurations when comparing models to ensure fair comparisons. Run evaluations on the complete benchmark datasets rather than subsets for publishable results.

Don't: compare scores from different evaluation frameworks without verifying that implementations match. Cherry-pick benchmarks that favor a particular model while omitting tasks where it underperforms. Ignore confidence intervals when reporting results from small evaluation datasets.

Limitations

Benchmark scores do not capture all aspects of model quality relevant to production use cases. Evaluation runtime scales with model size and the number of tasks, making full benchmark suites time-consuming for large models. Some benchmarks may be partially included in model training data, inflating scores without reflecting true generalization.