HELM

Stanford CRFM's Holistic Evaluation of Language Models: a maintained scenario-and-metric matrix with public leaderboards.

Positioning

HELM is the eval framework that came out of the Stanford CRFM paper of the same name. The thesis is that one number per benchmark is not enough; a useful evaluation measures multiple metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) across a wide set of scenarios, and reports the full matrix ¹. The OSS implementation has continued to evolve through 2026 (v0.5.16 shipped April 30, 2026) and maintains public leaderboards across four tracks: Capabilities, Safety, VHELM (vision-language), and HEIM (image generation).

The framework is heavier than lm-evaluation-harness and aimed at producing reports rather than single-number scores. For practitioners, it is most useful when you need a structured comparison across many scenarios and dimensions, not when you want a quick MMLU number.

Strengths

Structured scenario-and-metric matrix. The HELM design forces multi-dimensional reporting (accuracy plus calibration plus robustness plus fairness) rather than collapsing everything to one accuracy column.
Maintained leaderboards. Stanford CRFM keeps public Capabilities, Safety, VHELM, and HEIM leaderboards updated as new models ship. The public scores are useful as a sanity check against your own benchmarking.
Apache 2.0 and reproducible. The implementation is permissively licensed, the scenario definitions are versioned, and the reported numbers are reproducible from the same harness.

Limitations

Heavier than lm-evaluation-harness. Setup, scenario configuration, and metric stacking are more involved. For a quick MMLU comparison, the harness is simpler ².
Academic posture. The framework is designed for research reports, not CI. Most practitioners use it for one-off benchmarking, not as a continuous regression harness.
No application-level coverage. HELM scores models on closed-form scenarios. RAG, agents, and prompt-level evals are out of scope; use a different tool for those.

Best fit

Multi-dimensional model selection, language-coverage and fairness audits, reproducible academic-style reports, and cross-model comparisons where calibration and robustness matter as much as accuracy. Strongest fit for research engineers and platform teams making infrequent but high-stakes model decisions.

Getting started

pip install crfm-helm

# Run a small subset of MMLU scenarios on a HuggingFace model.
helm-run \
  --run-entries mmlu:subject=college_chemistry,model=huggingface/llama-3.1-8b-instruct \
  --suite my_audit_v1 \
  --max-eval-instances 100

# Summarize results into a report.
helm-summarize --suite my_audit_v1

# Optionally serve a local leaderboard UI.
helm-server --suite my_audit_v1

The local UI mirrors the layout of the public HELM leaderboards, with per-scenario and per-metric breakdowns.

Pricing notes

HELM is free under the Apache 2.0 license. Compute costs (model inference) are paid to the underlying providers. Running the full HELM scenario matrix against a frontier model is non-trivial on inference cost; most practitioners pick a subset of scenarios and a --max-eval-instances cap that produces statistically meaningful numbers without buying every token.

Alternatives

Consider lm-evaluation-harness for a lighter harness when you want a single benchmark number rather than the full matrix. Consider Inspect AI for agentic and sandboxed evals beyond closed-form scenarios. For application-level evals (RAG, agents), HELM is not the right tool; pick DeepEval, Ragas, or Promptfoo.

Citations and last verified

Verified 2026-05-29 against the HELM repository and the CRFM public leaderboards.

Liang et al., "Holistic Evaluation of Language Models," arXiv:2211.09110. ↩
Hendrycks et al., "Measuring Massive Multitask Language Understanding," arXiv:2009.03300. ↩