Lm Evaluation Harness
Automate language model evaluation and integrate standardized benchmarking workflows
LM Evaluation Harness is a community skill for benchmarking language models using standardized evaluation suites, covering task configuration, few-shot evaluation, metric computation, and result comparison across model checkpoints.
What Is This?
Overview
LM Evaluation Harness provides patterns for running comprehensive language model benchmarks using the EleutherAI evaluation framework. It covers task selection from hundreds of built-in benchmarks, few-shot prompt configuration, batch evaluation across multiple tasks, metric aggregation and reporting, and custom task definition for domain-specific evaluation. The skill enables researchers to measure model capabilities systematically across standardized tasks.
Who Should Use This
This skill serves researchers comparing model performance across standard NLP benchmarks, teams evaluating fine-tuned models against base model baselines, and engineers building automated evaluation pipelines that track model quality over training iterations.
Why Use It?
Problems It Solves
Implementing individual benchmark evaluations from scratch is time-consuming and error-prone. Different evaluation implementations of the same benchmark can produce inconsistent scores due to subtle differences in prompt formatting. Comparing published results from different papers is unreliable when evaluation setups differ. Running evaluations across many tasks manually requires coordinating dataset loading, prompting, and metric computation.
Core Highlights
Standardized task implementations ensure consistent evaluation across different models and research groups. Few-shot configuration controls the number of examples included in prompts for each benchmark task. Batch evaluation runs multiple benchmarks in a single command with aggregated result reporting. Custom task registration enables adding domain-specific evaluations alongside standard benchmarks.
How to Use It?
Basic Usage
from dataclasses import dataclass, field
@dataclass
class EvalTask:
name: str
num_fewshot: int = 0
metric: str = "accuracy"
dataset: list[dict] = field(default_factory=list)
@dataclass
class EvalConfig:
model_name: str
tasks: list[EvalTask] = field(default_factory=list)
batch_size: int = 8
output_dir: str = "./eval_results"
class EvaluationHarness:
def __init__(self, config: EvalConfig):
self.config = config
self.results: dict[str, dict] = {}
def build_prompt(self, task: EvalTask,
examples: list[dict],
query: dict) -> str:
parts = []
for ex in examples[:task.num_fewshot]:
parts.append(f"Q: {ex['question']}\nA: {ex['answer']}")
parts.append(f"Q: {query['question']}\nA:")
return "\n\n".join(parts)
def evaluate_task(self, task: EvalTask,
predict_fn) -> dict:
correct = 0
total = len(task.dataset)
for item in task.dataset:
prompt = self.build_prompt(
task, task.dataset[:task.num_fewshot], item)
prediction = predict_fn(prompt)
if prediction.strip() == item.get("answer", "").strip():
correct += 1
score = correct / max(total, 1)
self.results[task.name] = {
task.metric: round(score, 4), "total": total}
return self.results[task.name]Real-World Examples
from dataclasses import dataclass, field
import json
from pathlib import Path
@dataclass
class BenchmarkSuite:
name: str
tasks: list[EvalTask]
results: dict[str, dict] = field(default_factory=dict)
class ModelComparator:
def __init__(self):
self.model_results: dict[str, dict[str, float]] = {}
def add_result(self, model_name: str,
task_name: str, score: float):
self.model_results.setdefault(model_name, {})
self.model_results[model_name][task_name] = score
def compare(self) -> list[dict]:
rows = []
for model, scores in self.model_results.items():
avg = sum(scores.values()) / max(len(scores), 1)
rows.append({"model": model, "scores": scores,
"average": round(avg, 4)})
return sorted(rows, key=lambda x: x["average"],
reverse=True)
def save_report(self, path: str):
report = self.compare()
Path(path).write_text(json.dumps(report, indent=2))
comparator = ModelComparator()
comparator.add_result("model-base", "hellaswag", 0.72)
comparator.add_result("model-base", "arc_easy", 0.68)
comparator.add_result("model-tuned", "hellaswag", 0.78)
comparator.add_result("model-tuned", "arc_easy", 0.74)
for row in comparator.compare():
print(f"{row['model']}: avg={row['average']}")Advanced Tips
Run evaluations with multiple few-shot counts to understand how model performance scales with in-context examples. Use model parallelism for large models that do not fit in single GPU memory during evaluation. Register custom tasks by defining prompt templates and answer extraction functions for domain-specific benchmarks.
When to Use It?
Use Cases
Benchmark a fine-tuned model against its base version across standard NLP tasks. Compare multiple model candidates on reasoning, knowledge, and language understanding benchmarks before selecting one for production. Track evaluation scores across training checkpoints to identify the optimal stopping point.
Related Topics
Language model benchmarking, few-shot evaluation methods, NLP task datasets, model comparison frameworks, and evaluation metric standardization.
Important Notes
Requirements
The lm-eval Python package with task dependencies installed. Model accessible through a supported backend such as Hugging Face or a local GGUF file. Sufficient GPU memory for loading the model and running inference on benchmark datasets.
Usage Recommendations
Do: report the exact evaluation configuration including few-shot count and prompt format alongside benchmark scores. Use consistent configurations when comparing models to ensure fair comparisons. Run evaluations on the complete benchmark datasets rather than subsets for publishable results.
Don't: compare scores from different evaluation frameworks without verifying that implementations match. Cherry-pick benchmarks that favor a particular model while omitting tasks where it underperforms. Ignore confidence intervals when reporting results from small evaluation datasets.
Limitations
Benchmark scores do not capture all aspects of model quality relevant to production use cases. Evaluation runtime scales with model size and the number of tasks, making full benchmark suites time-consuming for large models. Some benchmarks may be partially included in model training data, inflating scores without reflecting true generalization.
More Skills You Might Like
Explore similar skills to enhance your workflow
Geopandas
Geopandas automation and integration for geospatial data analysis and visualization
Calendarhero Automation
Automate Calendarhero tasks via Rube MCP (Composio)
LinkedIn automation via browser relay or cookies for messaging, profile viewing, and network
Dadata Ru Automation
Automate Dadata Ru operations through Composio's Dadata Ru toolkit via
Modal
Deploy serverless cloud functions and automate Modal infrastructure integration
Senior Computer Vision
Advanced automation and integration of computer vision models for complex image processing and object detection