OpenAI Evals

The canonical OpenAI evaluation framework: model regression for OpenAI models, now also wired into the OpenAI Dashboard.

Positioning

OpenAI Evals is the original OSS evaluation framework from OpenAI, released in 2023 and still maintained. The codebase is a registry of YAML- and JSON-defined evals plus Python templates for common patterns (basic, model-graded, rag, multiple-choice). The same registry now backs the eval workflows in the OpenAI Dashboard, so the framework doubles as the API for the hosted UI ¹.

The strongest fit is teams running OpenAI models who want a canonical regression suite that the model owner recognizes. For application-level evals (RAG, agents, custom rubrics), more recent frameworks like DeepEval, Promptfoo, and Inspect AI tend to be more ergonomic.

Strengths

Canonical for OpenAI model regression. When OpenAI publishes model card numbers, the eval definitions are often available in this registry. Reproducing or comparing against an OpenAI-reported number is direct.
Dashboard integration. Running the same eval from the OpenAI Dashboard or the CLI yields the same result and shows up in the same project, which is useful for teams that prefer a UI for trace inspection.
Wide contributor base. The repository has 18.6k stars and a long tail of community-contributed evals across reasoning, factuality, code, and persona benchmarks.

Limitations

Less ergonomic for app-level evals. The framework was designed for closed-form model evals first. RAG, agents, and custom-rubric workflows can be expressed but more recent frameworks are easier.
OpenAI-centric. Non-OpenAI models can be wired in but the registry, templates, and Dashboard integration assume OpenAI is the primary target.
Vendor self-interest. OpenAI runs and maintains this framework; vendor-published benchmarks on OpenAI models are vendor-published.

Best fit

OpenAI model selection and regression, contamination ablations, reproducing OpenAI-published numbers, and teams whose application is built on the OpenAI SDK and who want a canonical eval registry recognized by the model owner.

Getting started

git clone https://github.com/openai/evals.git
cd evals
pip install -e .

# Run a registered eval against a model.
oaieval gpt-4o-mini test-match \
  --record_path results/test_match.jsonl

The OpenAI Cookbook has an end-to-end walkthrough including dataset construction, evaluator selection, and result analysis ². For a custom eval, write a YAML registry entry that points at a JSONL dataset and a Python class that extends one of the templates.

Pricing notes

The framework is free under the MIT license. Compute is billed at OpenAI API rates per token. If you use the OpenAI Dashboard to run evals, the same per-token billing applies. Model-graded evals (the bulk of the more interesting templates) call a judge model per row, which is the dominant cost; cache aggressively and use a cheaper judge model when calibration shows it agrees with the gold judge within the tolerance you care about. The CLI also supports caching and result-file replay, which is the usual cost-control story for iterative prompt work.

Alternatives

Consider lm-evaluation-harness for academic benchmarks across many model providers. Consider Inspect AI for sandboxed agentic evals. Consider DeepEval or Promptfoo for application-level CI regression. Consider HELM for a structured scenario-and-metric matrix.

Citations and last verified

Verified 2026-05-29 against the OpenAI Evals repository and the OpenAI Cookbook.

OpenAI Evals, https://github.com/openai/evals. ↩
OpenAI Cookbook, "Getting Started with OpenAI Evals." ↩