lm-evaluation-harness

EleutherAI's de facto academic harness for raw model benchmarks; backend for the HuggingFace Open LLM Leaderboard.

Positioning

The lm-evaluation-harness, often shortened to lm_eval, is EleutherAI's open-source benchmark runner. It is the backend that produces the numbers on the HuggingFace Open LLM Leaderboard and dozens of academic papers. The harness covers MMLU, BBH, ARC, HellaSwag, TruthfulQA, GSM8K, HumanEval, GPQA, IFEval, MMMLU, and hundreds of other tasks via YAML configs. A 2025 refactor cleaned up the CLI with run, ls, and validate subcommands and made task definitions YAML-first.

The harness is the right tool when you want to compute a model's score on a published benchmark with the same prompt template, few-shot examples, and scoring rules used by the academic community. For MMLU in particular, the harness implementation is the reference ¹ ².

Strengths

De facto academic standard. When a paper reports MMLU or HellaSwag, it almost always used this harness. Reproducing or comparing against a published number is straightforward.
YAML-first tasks. Adding a new benchmark or variant is a YAML file, not a code change. The 2025 refactor made this even cleaner.
Wide model backend support. HuggingFace Transformers, vLLM, OpenAI-compatible APIs, Anthropic, local quantized models, and others all plug in via simple model adapters.

Limitations

Not application-level. The harness scores models on closed-form benchmarks. It does not help you eval a RAG pipeline, an agent, or a prompt under your custom rubric.
Format sensitivity. Small changes to prompt templates, normalization, and choice-letter formatting can move benchmark scores by several points. Independent analyses have flagged this as a real reproducibility issue ².
No observability or dashboards. Results are JSON files. Pair with a hub or push to a notebook for visualization.

Best fit

Pure model benchmarking when you swap base models, contamination-style ablations, language-coverage audits, and reproducing published numbers. Strongest fit for research engineers and platform teams that need to make data-driven model selection decisions.

Getting started

pip install lm-eval

# MMLU on a local HuggingFace model.
lm_eval \
  --model hf \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
  --tasks mmlu \
  --batch_size auto \
  --output_path results/llama_mmlu.json

# Or on an OpenAI-compatible endpoint.
lm_eval \
  --model openai-completions \
  --model_args model=gpt-4o-mini \
  --tasks arc_challenge,hellaswag \
  --num_fewshot 5

The JSON output reports per-task accuracy, per-category breakdowns, and standard-error estimates. Subsequent runs can be diffed with lm_eval's built-in compare utilities or a notebook.

Pricing notes

lm-evaluation-harness is free under the MIT license. Compute costs (HuggingFace inference, vLLM hosting, OpenAI or Anthropic API calls) are paid to the underlying providers. Running the full Open LLM Leaderboard task suite on a frontier model through an API is the most expensive way to use the harness; local vLLM on a multi-GPU node is the usual cost-controlled path for repeated runs.

Alternatives

Consider HELM for a more structured scenario-and-metric matrix with maintained leaderboards. Consider Inspect AI for agentic and sandboxed evals that go beyond closed-form benchmarks. Consider OpenAI Evals for OpenAI-centric model regression. For application-level evals (RAG, agents), this harness is the wrong tool; pick DeepEval, Ragas, or Promptfoo.

Citations and last verified

Verified 2026-05-29 against the lm-evaluation-harness repository.

Hendrycks et al., "Measuring Massive Multitask Language Understanding," arXiv:2009.03300. ↩
HuggingFace, "Evaluating MMLU." ↩ ↩²