Positioning
Inspect AI is the open-source frontier-model eval framework from the UK AI Safety Institute. The library treats an eval as a Python program: a dataset, a solver chain (prompts, tools, generation), and a scorer, all composed by typed primitives. Sandboxing is a first-class concern; evals can run inside Docker, Kubernetes, Modal, or Proxmox VMs without ad-hoc plumbing. A VS Code extension gives the eval program a debugger-style UI for inspecting solver traces 1.
The companion inspect_evals repository ships 200+ prebuilt evals across reasoning, agentic, safety, multimodal, and tool-use categories, maintained jointly by AISI and an active community 2.
Strengths
- Sandboxing as a primitive. Docker, Kubernetes, Modal, and Proxmox backends mean code-execution and agentic evals do not require bespoke harnesses. This is the rare framework where running 200 SWE-bench problems against a sandboxed Python container is a few lines of code.
- 200+ prebuilt evals. Reasoning (GPQA, MATH), coding (HumanEval, MBPP), agentic (SWE-bench, GAIA, AgentBench, OSWorld), safety (HarmBench, AILuminate), and multimodal evals live in
inspect_evals with versioned implementations.
- High velocity for a gov-backed project. The repository shows 5,820 commits and continuous releases, which is unusual cadence for a government-maintained tool 3.
Limitations
- Steeper learning curve than pytest-style frameworks. Solvers, scorers, and the task abstraction add concepts that DeepEval and Promptfoo avoid. Worth it for frontier-style evals, overkill for a quick CI smoke test.
- Python-only. No first-class TypeScript or other-language SDK. Teams whose application is TypeScript run Inspect as a sidecar.
- No hosted product. Inspect is OSS-only. Dashboards, dataset versioning, and collaborative review are not in scope; pair with a hub like Langfuse or Phoenix if you need that.
Best fit
Frontier-model evaluation, safety and red-team programs that need sandboxing, agentic eval that runs SWE-bench, GAIA, or OSWorld at scale, and academic or policy-grade reports that need reproducible eval programs. Strongest fit for research engineers and safety teams.
Getting started
from inspect_ai import Task, eval as run_eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate
task = Task(
dataset=[
Sample(input="What is 2 + 2?", target="4"),
Sample(input="Capital of France?", target="Paris"),
],
solver=[generate()],
scorer=match(),
)
if __name__ == "__main__":
run_eval(task, model="anthropic/claude-3-5-sonnet-20241022")
The CLI command inspect eval my_task.py --model openai/gpt-4o-mini runs the same program, and the VS Code extension renders the solver trace per sample.
Pricing notes
Inspect AI is free under the MIT license. No paid SKU. Compute costs (model calls, sandboxing) are paid to the underlying providers. Modal and Kubernetes sandboxing add infrastructure cost that scales with the number of concurrent agent rollouts; budget separately for that line item when running SWE-bench-style suites at scale.
Alternatives
Consider DeepEval or Promptfoo for pytest-style or CLI-first OSS regression. Consider lm-evaluation-harness for raw model benchmarks without sandboxing. Consider HELM for the academic scenario matrix. For agentic eval specifically, Inspect is usually the right answer; alternatives include OpenAI Evals for OpenAI-centric workflows.
Citations and last verified
Verified 2026-05-29 against the Inspect AI documentation and the GitHub repositories.