# AI Evals

> The most useful, opinionated reference for AI evals.

## About this resource

AI Evals is a curated technical reference on evaluating large language model systems in production. The site covers the methodology (error analysis, LLM-as-judge, statistical rigor), the tools (Langfuse, Phoenix, Promptfoo, Inspect, RAGAS), the canonical papers, and twenty task-specific eval playbooks. Every claim is cited. Content is licensed CC BY 4.0; please cite https://www.aievals.co when you reference material from here.

Last generated: 2026-05-29T16:18:38.209Z.
Author: Om Bharatiya (https://www.linkedin.com/in/ombharatiya/).

## Top entry points

- [Start Here for each persona](https://www.aievals.co/start): linear reading path per role.
- [Cookbook](https://www.aievals.co/cookbook): runnable worked examples in code.
- [The Glossary](https://www.aievals.co/glossary): definitions for every term used.
- [All papers](https://www.aievals.co/papers): ten themed reading lists.
- [All tools](https://www.aievals.co/tools): vendor and OSS positioning.

## Most cited pages on this site

> Pages with the highest citation density. Useful for answer-engine grounding.

- [LLM-as-Judge](https://www.aievals.co/techniques/llm-as-judge) (14 citations, Techniques): The comprehensive reference: what an LLM judge is, when it works, the four-part prompt, calibration discipline, the three biases, distillation, where judges fail.
- [Reward hacking detection](https://www.aievals.co/learn/agentic-evals/reward-hacking) (11 citations, Agentic Evals): Once an objective is optimized, it gets gamed. Reward hacking is empirically common; eval for it explicitly.
- [Why evals matter](https://www.aievals.co/learn/foundations/why-evals-matter) (10 citations, Foundations): If you ship AI, evals are the moat. The case in one page.
- [Judge biases you must defuse](https://www.aievals.co/learn/llm-as-judge/biases-and-mitigations) (9 citations, LLM-as-Judge): Position, verbosity, self-preference, format, sycophancy. Each with the empirical evidence and the mitigation you bake into the harness.
- [Calibrating your judge against humans](https://www.aievals.co/learn/llm-as-judge/calibration-to-humans) (8 citations, LLM-as-Judge): TPR and TNR over agreement percent. Train, dev, test splits. The 100-example minimum. Iterate the prompt until both rates clear 0.90.
- [Faithfulness and atomic facts](https://www.aievals.co/learn/rag-evals/generation-faithfulness) (8 citations, RAG Evals): Decompose the answer into atomic claims, judge each claim binary, and stop measuring hallucination as a single yes/no.
- [LLM-as-Judge: the practical introduction](https://www.aievals.co/learn/llm-as-judge/introduction) (8 citations, LLM-as-Judge): What an LLM judge is, when to use one, when not to, and the three pitfalls that show up on every team's first attempt.
- [Online vs offline: where each fires](https://www.aievals.co/learn/production/online-vs-offline) (8 citations, Production & Observability): Offline evals run on a frozen dataset for every change. Online evals sample real traffic continuously. They answer different questions and both belong in production.
- [The eval maturity model](https://www.aievals.co/learn/foundations/eval-maturity-model) (8 citations, Foundations): Five stages from vibe-checking to a continuous quality program. Where are you?
- [Trace schema: what to log](https://www.aievals.co/learn/production/trace-schema) (8 citations, Production & Observability): The trace is the unit of analysis for every eval program. This is the schema, the fields that matter, and the storage shape that makes downstream queries cheap.
- [Trajectory-level evaluation](https://www.aievals.co/learn/agentic-evals/trajectory-vs-outcome) (8 citations, Agentic Evals): Outcome-only scoring ranks agents in the wrong order. Score the trajectory and the result.
- [Where LLM judges fail](https://www.aievals.co/learn/llm-as-judge/judgebench-honest-take) (8 citations, LLM-as-Judge): JudgeBench: GPT-4o is roughly random on hard correctness judgments. The implications for your pipeline are concrete, not theoretical.
- [Build your trace viewer in an afternoon](https://www.aievals.co/learn/error-analysis/the-data-viewer) (7 citations, Error Analysis): The single highest-return tool for an eval program. Streamlit and Next.js variants, with the minimum feature set.
- [Eval RAG in two layers](https://www.aievals.co/learn/rag-evals/retrieval-vs-generation) (7 citations, RAG Evals): Separate retrieval and generation or you cannot fix what breaks.
- [LLM-judge calibration loop](https://www.aievals.co/cookbook/llm-judge-calibration) (7 citations, Cookbook): Push TPR and TNR past 0.90 by iterating the rubric, not the model. A complete loop adapted from the HuggingFace LLM-as-judge cookbook.

## Learn

### Foundations

> Why evals matter, the maturity model, and what makes LLM evaluation hard.

- [Why evals matter](https://www.aievals.co/learn/foundations/why-evals-matter): If you ship AI, evals are the moat. The case in one page.
- [The eval maturity model](https://www.aievals.co/learn/foundations/eval-maturity-model): Five stages from vibe-checking to a continuous quality program. Where are you?

### Error Analysis

> Open coding, dimensional sampling, and the 60-80% rule.

- [The 60-80% rule](https://www.aievals.co/learn/error-analysis/the-60-80-rule): Three failure modes usually account for most of your bugs. Find them, fix them, repeat.
- [Open coding for AI traces](https://www.aievals.co/learn/error-analysis/open-coding-workflow): Read traces, write notes, cluster notes into failure modes, prioritize fixes. The three-stage process and the LLM-assisted parts.
- [Build your trace viewer in an afternoon](https://www.aievals.co/learn/error-analysis/the-data-viewer): The single highest-return tool for an eval program. Streamlit and Next.js variants, with the minimum feature set.

### LLM-as-Judge

> Judge prompt structure, calibration against humans, biases and mitigations.

- [LLM-as-Judge: the practical introduction](https://www.aievals.co/learn/llm-as-judge/introduction): What an LLM judge is, when to use one, when not to, and the three pitfalls that show up on every team's first attempt.
- [The four-part judge prompt](https://www.aievals.co/learn/llm-as-judge/judge-prompt-structure): A copyable template: role and definitions, criteria, few-shot from your train set, output format. Plus the runnable invocation.
- [Calibrating your judge against humans](https://www.aievals.co/learn/llm-as-judge/calibration-to-humans): TPR and TNR over agreement percent. Train, dev, test splits. The 100-example minimum. Iterate the prompt until both rates clear 0.90.
- [Judge biases you must defuse](https://www.aievals.co/learn/llm-as-judge/biases-and-mitigations): Position, verbosity, self-preference, format, sycophancy. Each with the empirical evidence and the mitigation you bake into the harness.
- [Where LLM judges fail](https://www.aievals.co/learn/llm-as-judge/judgebench-honest-take): JudgeBench: GPT-4o is roughly random on hard correctness judgments. The implications for your pipeline are concrete, not theoretical.

### Datasets & Golden Sets

> Construction, versioning, lineage, and dimensional coverage.

- [Constructing your eval dataset](https://www.aievals.co/learn/datasets/construction): How to build the first hundred labeled examples, then grow them past five hundred without losing signal.
- [Versioning your eval set](https://www.aievals.co/learn/datasets/versioning-lineage): A content-hashed eval set pinned to every result is a contract with your past self.

### RAG Evals

> Two-layer evaluation: retrieval metrics plus generation faithfulness.

- [Eval RAG in two layers](https://www.aievals.co/learn/rag-evals/retrieval-vs-generation): Separate retrieval and generation or you cannot fix what breaks.
- [Faithfulness and atomic facts](https://www.aievals.co/learn/rag-evals/generation-faithfulness): Decompose the answer into atomic claims, judge each claim binary, and stop measuring hallucination as a single yes/no.

### Agentic Evals

> Trajectory vs outcome, tool-call evaluation, reward hacking, pass^k.

- [Trajectory-level evaluation](https://www.aievals.co/learn/agentic-evals/trajectory-vs-outcome): Outcome-only scoring ranks agents in the wrong order. Score the trajectory and the result.
- [Pass^k: the metric that catches inconsistent agents](https://www.aievals.co/learn/agentic-evals/pass-k-and-consistency): Pass@1 hides catastrophic inconsistency. Report pass^k or you do not know how often the agent will repeatedly succeed.
- [Reward hacking detection](https://www.aievals.co/learn/agentic-evals/reward-hacking): Once an objective is optimized, it gets gamed. Reward hacking is empirically common; eval for it explicitly.

### Production & Observability

> Online vs offline, trace schema, drift detection, CI integration.

- [Online vs offline: where each fires](https://www.aievals.co/learn/production/online-vs-offline): Offline evals run on a frozen dataset for every change. Online evals sample real traffic continuously. They answer different questions and both belong in production.
- [Trace schema: what to log](https://www.aievals.co/learn/production/trace-schema): The trace is the unit of analysis for every eval program. This is the schema, the fields that matter, and the storage shape that makes downstream queries cheap.

### Statistics & Rigor

> Confidence intervals, bootstrap, power, inter-rater agreement.

- [Confidence intervals: Wilson and Clopper-Pearson](https://www.aievals.co/learn/statistics/confidence-intervals): Wald is wrong. Wilson is the practical default. Clopper-Pearson is the conservative exact. With code and the case for why N=10 evals tell you almost nothing.
- [Clustered standard errors: when CIs lie](https://www.aievals.co/learn/statistics/clustered-standard-errors): When eval questions cluster (multiple questions per topic, multiple turns per session), naive standard errors are three times too small. The cluster-aware variance formula and code, with the...

### Safety & Red-Team

> OWASP LLM Top 10, HarmBench, AILuminate, jailbreaks and defenses.


### Governance, Risk, Compliance

> EU AI Act, NIST AI RMF, RSPs, risk registers, trust artifacts.


## Techniques

> Methodology reference: deeper than the Learn chapters, organized by technique.

- [LLM-as-Judge](https://www.aievals.co/techniques/llm-as-judge): The comprehensive reference: what an LLM judge is, when it works, the four-part prompt, calibration discipline, the three biases, distillation, where judges fail.

## Tools

> Eval platforms and libraries with positioning, license, and trade-offs.


## Papers

> Curated reading list across ten themes, each paper annotated.

- [Agentic eval](https://www.aievals.co/papers/agentic): Trajectory, outcome, and the harder benchmarks that came after AgentBench. WebArena, SWE-bench, GAIA, OSWorld, tau-bench, MLE-bench, AgentDojo, SWE-Lancer, BrowseComp, Cybench.
- [Foundational eval theory](https://www.aievals.co/papers/foundational): What an eval is, what a benchmark is, and why rankings flip when nothing changed. The papers that built the vocabulary.
- [LLM-as-Judge](https://www.aievals.co/papers/llm-as-judge): Building, calibrating, and stress-testing model judges. From G-Eval and MT-Bench through Prometheus 2 to JudgeBench.
- [Long-context](https://www.aievals.co/papers/long-context): Why '1M tokens' rarely means a million tokens. Lost in the Middle, LongBench, RULER, BABILong, NoCha, Loong, NoLiMa.
- [Methodology and production](https://www.aievals.co/papers/methodology-production): SPADE, RewardBench, MixEval, WildBench, IFEval, InfoBench, FollowBench, plus the 2025 agent-eval and observability work.
- [Multimodal](https://www.aievals.co/papers/multimodal): MMBench, MM-Vet, MMMU, BLINK, Vibe-Eval, Video-MME, MMMU-Pro, ChartQA.
- [RAG, retrieval, factuality](https://www.aievals.co/papers/rag-factuality): BEIR, FActScore, RAGAS, Self-RAG, ARES, SAFE, SimpleQA, HalluLens.
- [Reasoning and math](https://www.aievals.co/papers/reasoning-and-math): From MATH and Chain-of-Thought to FrontierMath, ARC-AGI-2, Humanity's Last Exam, and the faithfulness problem.
- [Safety, red-team, robustness](https://www.aievals.co/papers/safety-red-team): TruthfulQA, BBQ, Constitutional AI, GCG, sycophancy, HarmBench, sandbagging, alignment faking.
- [Tool use and function calling](https://www.aievals.co/papers/tool-use): API-Bank, Gorilla, ToolBench, BFCL, TRAJECT-Bench, plus the tool-injection security work in AgentDojo.

## Cookbook

> Runnable worked examples. Estimated minutes per recipe.

- [Agent eval with τ-bench](https://www.aievals.co/cookbook/agent-eval-with-tau-bench): Run a subset of τ-bench's retail tasks against your agent and report pass@1 and pass^4. The consistency metric that catches agents that almost work.
- [LLM-judge calibration loop](https://www.aievals.co/cookbook/llm-judge-calibration): Push TPR and TNR past 0.90 by iterating the rubric, not the model. A complete loop adapted from the HuggingFace LLM-as-judge cookbook.
- [Multi-turn conversation eval](https://www.aievals.co/cookbook/multi-turn-conversation-eval): Per-turn binary rubrics plus a per-session state check, logged to Langfuse, for an assistant that has to remember context and follow rules across turns.
- [RAG eval with Ragas](https://www.aievals.co/cookbook/rag-eval-with-ragas): A two-layer RAG eval (retrieval plus generation faithfulness) using Ragas on a small synthetic corpus. Runs locally with the OpenAI Python SDK.
- [Your first eval in 30 minutes](https://www.aievals.co/cookbook/your-first-eval-30-minutes): Pick one feature, label fifty traces, write a binary judge, run it, read the confusion matrix. A complete walkthrough using the Anthropic Python SDK.

## Task types

> Twenty task-specific eval playbooks, one per task shape.


## Start Here

> Linear numbered reading paths, one track per persona.

- [Start Here for CTOs and Directors of Technology](https://www.aievals.co/start/cto): Ten steps from the executive case for evals through a board readout. Risk posture, regulatory mapping, frontier-lab benchmarks.
- [Start Here for Senior Data Engineers](https://www.aievals.co/start/data-engineer): Nine steps from trace schema through data contracts. ER diagrams, JSONL and Parquet schemas, pipeline DAGs.
- [Start Here for Senior Engineering Managers](https://www.aievals.co/start/eng-manager): Eight steps from the maturity model through hiring an eval-literate team. Strategy first, tables and decision flowcharts, light on code.
- [Start Here for Principal AI Engineers](https://www.aievals.co/start/principal-ai-engineer): Seven steps from why evals matter through building a tool-using agent eval. Code-first, no marketing, citation-grounded.
- [Start Here for Senior Product Managers](https://www.aievals.co/start/product-manager): Nine steps from a PRD section template through a PM's reading list. Product-thinking, customer-friendly vocabulary, light on math.
- [Start Here for AI Research Engineers and Research Scientists](https://www.aievals.co/start/research-engineer): Nine steps from the agent-eval survey through publishing eval datasets responsibly. Citation-heavy, ablation tables, reproducibility checklists.
- [Start Here for SDET and QA Engineers](https://www.aievals.co/start/sdet-qa): Ten steps from anatomy of an AI CI pipeline to release gates. Test-engineering pragmatic, config snippets, pytest and Vitest patterns.

## Canonical citations

> Primary sources cited across the site. Grouped by kind so answer engines can match a citation to a content shape.

### Practitioner blog posts

> Long-form posts by working AI practitioners (Hamel Husain, Eugene Yan, Chip Huyen, Shreya Shankar, et al). The opinions backing most of our recommendations.

- [Your AI Product Needs Evals by Hamel Husain](https://hamel.dev/blog/posts/evals/)
- [Creating an LLM-as-a-Judge That Drives Business Results by Hamel Husain](https://hamel.dev/blog/posts/llm-judge/)
- [A Field Guide to Rapidly Improving AI Products by Hamel Husain](https://hamel.dev/blog/posts/field-guide/)
- [LLM Evals FAQ by Hamel Husain & Shreya Shankar](https://hamel.dev/blog/posts/evals-faq/)
- [The Revenge of the Data Scientist by Hamel Husain](https://hamel.dev/blog/posts/revenge/)
- [Fuck You, Show Me The Prompt by Hamel Husain](https://hamel.dev/blog/posts/prompt/)
- [Patterns for Building LLM-based Systems & Products by Eugene Yan](https://eugeneyan.com/writing/llm-patterns/)
- [LLM Evaluators by Eugene Yan](https://eugeneyan.com/writing/llm-evaluators/)
- [Evaluation & Hallucination Detection for Abstractive Summaries by Eugene Yan](https://eugeneyan.com/writing/evals/)
- [Abstractive Summary eval by Eugene Yan](https://eugeneyan.com/writing/abstractive/)
- [OOD Finetuning to Bootstrap Hallucination Detection by Eugene Yan](https://eugeneyan.com/writing/finetuning/)
- [The AI Engineering Flywheel by Shreya Shankar](https://www.sh-reya.com/blog/ai-engineering-flywheel/)
- [Building a Generative AI Platform by Chip Huyen](https://huyenchip.com/2024/07/25/genai-platform.html)
- [900 Most Popular Open Source AI Tools by Chip Huyen](https://huyenchip.com/2024/03/14/ai-oss.html)
- [Challenges in Evaluating AI Systems by Anthropic](https://www.anthropic.com/news/evaluating-ai-systems)
- [Introducing Computer Use by Anthropic](https://www.anthropic.com/news/3-5-models-and-computer-use)
- [Chatbot Arena Benchmarking Blog by LMSYS](https://lmsys.org/blog/2023-05-03-arena/)
- [Chatbot Arena BT analysis by LMSYS](https://lmsys.org/blog/2023-12-07-leaderboard/)
- [Aligning LLM-as-Judge w/ Human Preferences by LangChain](https://www.langchain.com/blog/aligning-llm-as-a-judge-with-human-preferences)

### Papers

> Academic and lab-published papers on eval methodology, benchmarks, and judge calibration.

- [MMLU (2020) by Hendrycks et al](https://arxiv.org/abs/2009.03300)
- [Beyond Accuracy: CheckList (2020) by Ribeiro et al](https://arxiv.org/abs/2005.04118)
- [BIG-Bench (2022) by Srivastava et al](https://arxiv.org/abs/2206.04615)
- [BBH (2022) by Suzgun et al](https://arxiv.org/abs/2210.09261)
- [HELM (2022) by Liang et al](https://arxiv.org/abs/2211.09110)
- [The Benchmark Lottery (2021) by Dehghani, Tay et al](https://arxiv.org/abs/2107.07002)
- [Stop Uploading Test Data in Plain Text (2023) by Jacovi et al](https://arxiv.org/abs/2305.10160)
- [Are We Done with MMLU? (2024) by Gema et al](https://arxiv.org/abs/2406.04127)
- [Adding Error Bars to Evals (2024) by Evan Miller](https://arxiv.org/abs/2411.00640)
- [G-Eval (2023) by Liu et al](https://arxiv.org/abs/2303.16634)
- [Judging LLM-as-a-Judge with MT-Bench (2023) by Zheng et al](https://arxiv.org/abs/2306.05685)
- [Prometheus (2023) by Kim et al](https://arxiv.org/abs/2310.08491)
- [Auto-J (2023) by Li et al](https://arxiv.org/abs/2310.05470)
- [JudgeLM (2023) by Zhu et al](https://arxiv.org/abs/2310.17631)
- [Who Validates the Validators? (2024) by Shankar et al](https://arxiv.org/abs/2404.12272)
- [Prometheus 2 (2024) by Kim et al](https://arxiv.org/abs/2405.01535)
- [Judging the Judges (2024) by Thakur et al](https://arxiv.org/abs/2406.12624)
- [JudgeBench (2024) by Tan et al](https://arxiv.org/abs/2410.12784)
- [Chatbot Arena (2024) by Chiang et al](https://arxiv.org/abs/2403.04132)
- [Arena Hard / BenchBuilder (2024) by Li et al](https://arxiv.org/abs/2406.11939)
- [LLM Comparator (2024) by Kahng et al](https://arxiv.org/abs/2402.10524)
- [Length-Controlled AlpacaEval (2024) by Dubois et al](https://arxiv.org/abs/2404.04475)
- [AgentBench (2023) by Liu et al](https://arxiv.org/abs/2308.03688)
- [WebArena (2023) by Zhou et al](https://arxiv.org/abs/2307.13854)
- [VisualWebArena (2024) by Koh et al](https://arxiv.org/abs/2401.13649)
- [WorkArena (2024) by Drouin et al](https://arxiv.org/abs/2403.07718)
- [SWE-bench (2023) by Jimenez et al](https://arxiv.org/abs/2310.06770)
- [SWE-bench Multimodal (2024) by Yang et al](https://arxiv.org/abs/2410.03859)
- [GAIA (2023) by Mialon et al](https://arxiv.org/abs/2311.12983)
- [AgentBoard (2024) by Ma et al](https://arxiv.org/abs/2401.13178)
- [OSWorld (2024) by Xie et al](https://arxiv.org/abs/2404.07972)
- [τ-bench (2024) by Yao et al](https://arxiv.org/abs/2406.12045)
- [MLE-bench (2024) by Chan et al](https://arxiv.org/abs/2410.07095)
- [ScienceAgentBench (2024) by Chen et al](https://arxiv.org/abs/2410.05080)
- [AgentDojo (2024) by Debenedetti et al](https://arxiv.org/abs/2406.13352)
- [SWE-Lancer (2025) by Miserendino et al](https://arxiv.org/abs/2502.12115)
- [BrowseComp (2025) by Wei et al](https://arxiv.org/abs/2504.12516)
- [Cybench (2024) by Zhang et al](https://arxiv.org/abs/2408.08926)
- [LLM Agent Eval Survey (2025) by Yehudai et al](https://arxiv.org/abs/2503.16416)
- [Agent Eval Survey (2025) by Mohammadi et al](https://arxiv.org/abs/2507.21504)
- [Agent-as-a-Judge (2024) by Zhuge et al](https://arxiv.org/abs/2410.10934)
- [TRAJECT-Bench (2025) by He et al](https://arxiv.org/abs/2510.04550)
- [Beyond Black-Box Benchmarking (2025) by Moshkovich et al](https://arxiv.org/abs/2503.06745)
- [In-Context Reward Hacking (2024) by Pan et al](https://arxiv.org/abs/2402.06627)
- [Concrete Problems in AI Safety (2016) by Amodei et al](https://arxiv.org/abs/1606.06565)
- [AI Sandbagging (2024) by van der Weij et al](https://arxiv.org/abs/2406.07358)
- [Alignment Faking (2024) by Greenblatt et al](https://arxiv.org/abs/2412.14093)
- [Lost in the Middle (2023) by Liu et al](https://arxiv.org/abs/2307.03172)
- [LongBench (2023) by Bai et al](https://arxiv.org/abs/2308.14508)
- [∞Bench (2024) by Zhang et al](https://arxiv.org/abs/2402.13718)
- [RULER (2024) by Hsieh et al](https://arxiv.org/abs/2404.06654)
- [BABILong (2024) by Kuratov et al](https://arxiv.org/abs/2406.10149)
- [One Thousand and One Pairs (2024) by Karpinska et al](https://arxiv.org/abs/2406.16264)
- [Loong (2024) by Wang et al](https://arxiv.org/abs/2406.17419)
- [LongBench v2 (2024) by Bai et al](https://arxiv.org/abs/2412.15204)
- [NoLiMa (2025) by Modarressi et al](https://arxiv.org/abs/2502.05167)
- [On the Measure of Intelligence (2019) by Chollet](https://arxiv.org/abs/1911.01547)
- [MATH (2021) by Hendrycks et al](https://arxiv.org/abs/2103.03874)
- [Chain-of-Thought (2022) by Wei et al](https://arxiv.org/abs/2201.11903)
- [Self-Consistency (2022) by Wang et al](https://arxiv.org/abs/2203.11171)
- [Let's Verify Step by Step (2023) by Lightman et al](https://arxiv.org/abs/2305.20050)
- [GPQA (2023) by Rein et al](https://arxiv.org/abs/2311.12022)
- [Skill-Mix (2023) by Yu et al](https://arxiv.org/abs/2310.17567)
- [FrontierMath (2024) by Glazer et al](https://arxiv.org/abs/2411.04872)
- [ARC-AGI-2 (2025) by Chollet et al](https://arxiv.org/abs/2505.11831)
- [Humanity's Last Exam (2025) by Phan et al](https://arxiv.org/abs/2501.14249)
- [TruthfulQA (2021) by Lin et al](https://arxiv.org/abs/2109.07958)
- [BBQ (2021) by Parrish et al](https://arxiv.org/abs/2110.08193)
- [Red Teaming Language Models (2022) by Ganguli et al](https://arxiv.org/abs/2209.07858)
- [Constitutional AI (2022) by Bai et al](https://arxiv.org/abs/2212.08073)
- [Universal Adversarial Attacks (2023) by Zou et al](https://arxiv.org/abs/2307.15043)
- [Towards Understanding Sycophancy (2023) by Sharma et al](https://arxiv.org/abs/2310.13548)
- [Quantifying LM Sensitivity to Spurious Features (2023) by Sclar et al](https://arxiv.org/abs/2310.11324)
- [HarmBench (2024) by Mazeika et al](https://arxiv.org/abs/2402.04249)
- [BOLD (2021) by Dhamala et al](https://arxiv.org/abs/2101.11718)
- [HaluEval (2023) by Li et al](https://arxiv.org/abs/2305.11747)
- [BEIR (2021) by Thakur et al](https://arxiv.org/abs/2104.08663)
- [FActScore (2023) by Min et al](https://arxiv.org/abs/2305.14251)
- [RAGAS (2023) by Es et al](https://arxiv.org/abs/2309.15217)
- [Self-RAG (2023) by Asai et al](https://arxiv.org/abs/2310.11511)
- [ARES (2023) by Saad-Falcon et al](https://arxiv.org/abs/2311.09476)
- [SAFE (2024) by Wei et al](https://arxiv.org/abs/2403.18802)
- [SimpleQA (2024) by Wei et al](https://arxiv.org/abs/2411.04368)
- [HalluLens (2025) by Bang et al](https://arxiv.org/abs/2504.17550)
- [MMBench (2023) by Liu et al](https://arxiv.org/abs/2307.06281)
- [MM-Vet (2023) by Yu et al](https://arxiv.org/abs/2308.02490)
- [MMMU (2023) by Yue et al](https://arxiv.org/abs/2311.16502)
- [BLINK (2024) by Fu et al](https://arxiv.org/abs/2404.12390)
- [Vibe-Eval (2024) by Padlewski et al](https://arxiv.org/abs/2405.02287)
- [Video-MME (2024) by Fu et al](https://arxiv.org/abs/2405.21075)
- [MMMU-Pro (2024) by Yue et al](https://arxiv.org/abs/2409.02813)
- [ChartQA (2022) by Masry et al](https://arxiv.org/abs/2203.10244)
- [DocVQA (2020) by Mathew et al](https://arxiv.org/abs/2007.00398)
- [API-Bank (2023) by Li et al](https://arxiv.org/abs/2304.08244)
- [Gorilla (2023) by Patil et al](https://arxiv.org/abs/2305.15334)
- [ToolLLM/ToolBench (2023) by Qin et al](https://arxiv.org/abs/2307.16789)
- [SPADE (2024) by Shankar et al](https://arxiv.org/abs/2401.03038)
- [RewardBench (2024) by Lambert et al](https://arxiv.org/abs/2403.13787)
- [MixEval (2024) by Ni et al](https://arxiv.org/abs/2406.06565)
- [WildBench (2024) by Lin et al](https://arxiv.org/abs/2406.04770)
- [InfoBench (2024) by Qin et al](https://arxiv.org/abs/2401.03601)
- [IFEval (2023) by Zhou et al](https://arxiv.org/abs/2311.07911)
- [FollowBench (2023) by Jiang et al](https://arxiv.org/abs/2310.20410)
- [Data-Centric AI for LLMs (2024) by Xu et al](https://arxiv.org/abs/2406.14473)
- [HumanEval (2021) by Chen et al](https://arxiv.org/abs/2107.03374)
- [MBPP (2021) by Austin et al](https://arxiv.org/abs/2108.07732)
- [APPS (2021) by Hendrycks et al](https://arxiv.org/abs/2105.09938)
- [HumanEval+/MBPP+ (2023) by Liu et al](https://arxiv.org/abs/2305.01210)
- [LiveCodeBench (2024) by Jain et al](https://arxiv.org/abs/2403.07974)
- [BigCodeBench (2024) by Zhuo et al](https://arxiv.org/abs/2406.15877)
- [RepoBench (2023) by Liu, Xu, McAuley](https://arxiv.org/abs/2306.03091)
- [CRUXEval (2024) by Gu et al](https://arxiv.org/abs/2401.03065)
- [ClassEval (2023) by Du et al](https://arxiv.org/abs/2308.01861)
- [Multi-SWE-bench (2025) by Zan et al](https://arxiv.org/abs/2504.02605)
- [CLIP (2021) by Radford et al](https://arxiv.org/abs/2103.00020)
- [FID (Two Time-Scale Update Rule) (2017) by Heusel et al](https://arxiv.org/abs/1706.08500)
- [SimCSE (2021) by Gao, Yao, Chen](https://arxiv.org/abs/2104.08821)
- [WMT24 Findings](https://aclanthology.org/2024.wmt-1.1/)

### Tool documentation

> Vendor and open-source eval platform documentation.

- [Langfuse Docs](https://langfuse.com/docs)
- [LangSmith Evaluation Concepts](https://docs.langchain.com/langsmith/evaluation-concepts)
- [Langfuse Scores](https://langfuse.com/docs/scores/overview)
- [Braintrust Docs](https://www.braintrust.dev/docs)
- [Phoenix Tutorials/Evals](https://github.com/Arize-ai/phoenix/tree/main/tutorials/evals)
- [Phoenix Cookbook Evaluation](https://arize.com/docs/phoenix/cookbook/evaluation)
- [Inspect AI Docs](https://inspect.aisi.org.uk/)
- [RAGAS Docs](https://docs.ragas.io/)
- [RAG eval by RAGAS Getting Started](https://docs.ragas.io/en/stable/getstarted/rag_eval/)
- [RAGAS Faithfulness metric](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/)
- [TruLens Docs](https://www.trulens.org/)
- [Promptfoo Docs](https://www.promptfoo.dev/)
- [Promptfoo Red-team](https://www.promptfoo.dev/docs/red-team/)
- [Evaluate an LLM Application by LangSmith](https://docs.langchain.com/langsmith/evaluate-llm-application)

### Standards and regulations

> NIST, EU AI Act, OWASP, ITU, MLCommons.

- [ITU-T P.808. Subjective MOS via crowdsourcing](https://www.itu.int/rec/T-REC-P.808)
- [MLCommons AILuminate](https://mlcommons.org/benchmarks/ailuminate/)
- [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework)
- [official portal by EU AI Act](https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai)
- [OWASP Top 10 for LLM Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications/)

### Other sources

- [Yan, Bischof, Frye, Husain, Liu, Shankar by Applied LLMs](https://applied-llms.org/)
- [LLM Powered Autonomous Agents by Lilian Weng](https://lilianweng.github.io/posts/2023-06-23-agent/)
- [Extrinsic Hallucinations in LLMs by Lilian Weng](https://lilianweng.github.io/posts/2024-07-07-hallucination/)
- [Adversarial Attacks on LLMs by Lilian Weng](https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/)
- [Specification Gaming Examples by Victoria Krakovna](https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/)
- [LLM Evaluation Guidebook by Clémentine Fourrier](https://huggingface.co/blog/clefourrier/llm-evaluation)
- [Evaluating MMLU by HuggingFace](https://huggingface.co/blog/evaluating-mmlu-leaderboard)
- [Hard Stuff Nobody Talks About by Honeycomb](https://www.honeycomb.io/blog/hard-stuff-nobody-talks-about-llm)
- [Building Effective Agents by Anthropic](https://www.anthropic.com/engineering/building-effective-agents)
- [SWE-bench Sonnet by Anthropic](https://www.anthropic.com/research/swe-bench-sonnet)
- [Measuring Faithfulness in CoT by Anthropic](https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning)
- [Reasoning Models Don't Say What They Think by Anthropic](https://www.anthropic.com/research/reasoning-models-dont-say-think)
- [Measuring Persuasiveness of LMs by Anthropic](https://www.anthropic.com/research/measuring-model-persuasiveness)
- [Anthropic Responsible Scaling Policy](https://www.anthropic.com/responsible-scaling-policy)
- [Agentic Misalignment by Anthropic](https://www.anthropic.com/research/agentic-misalignment)
- [Adding Error Bars to Evals (blog) by Anthropic](https://www.anthropic.com/research/statistical-approach-to-model-evals)
- [Anthropic Tool Use Overview (strict mode)](https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview)
- [Anthropic Evaluation Tool (Console)](https://platform.claude.com/docs/en/test-and-evaluate/eval-tool)
- [Define success criteria and build evaluations by Anthropic](https://platform.claude.com/docs/en/docs/test-and-evaluate/develop-tests)
- [OpenAI Evals (GitHub)](https://github.com/openai/evals)
- [Introducing SWE-bench Verified by OpenAI](https://openai.com/index/introducing-swe-bench-verified/)
- [OpenAI Preparedness Framework](https://openai.com/safety/preparedness/)
- [A Practical Guide to Building Agents by OpenAI](https://openai.com/business/guides-and-resources/a-practical-guide-to-building-ai-agents/)
- [Getting Started with OpenAI Evals by OpenAI Cookbook](https://developers.openai.com/cookbook/examples/evaluation/getting_started_with_openai_evals)
- [Eval LLMs for SQL Generation by OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/evaluation/How_to_evaluate_LLMs_for_SQL_generation.ipynb)
- [Evaluate RAG with LlamaIndex by OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/evaluation/Evaluate_RAG_with_LlamaIndex.ipynb)
- [OpenAI Preparedness GitHub (MLE-bench, SWE-Lancer, PaperBench)](https://github.com/openai/preparedness)
- [OpenAI SWE-Lancer Benchmark](https://github.com/openai/swelancer-benchmark)
- [Needle in a Haystack by Greg Kamradt](https://github.com/gkamradt/LLMTest_NeedleInAHaystack)
- [BFCL (Patil, Mao et al., ICML 2025) (2025)](https://gorilla.cs.berkeley.edu/leaderboard.html)
- [BIRD-bench](https://bird-bench.github.io/)
- [Spider by Yu et al](https://yale-lily.github.io/spider)
- [PartiPrompts by Google Research](https://github.com/google-research/parti)
- [MTEB Benchmark](https://github.com/embeddings-benchmark/mteb)
- [Whisper repo (Common Voice 15, Fleurs WER) by OpenAI](https://github.com/openai/whisper)
- [Comet Opik](https://github.com/comet-ml/opik)
- [Inspect Evals Registry](https://github.com/UKGovernmentBEIS/inspect_evals)
- [Inspect AI GitHub](https://github.com/UKGovernmentBEIS/inspect_ai)
- [RAGAS GitHub](https://github.com/explodinggradients/ragas)
- [DeepEval GitHub](https://github.com/confident-ai/deepeval)
- [Giskard GitHub](https://github.com/Giskard-AI/giskard)
- [LlamaIndex Evaluation](https://developers.llamaindex.ai/python/framework/module_guides/evaluating/)
- [Outlines GitHub](https://github.com/outlines-dev/outlines)
- [Instructor GitHub](https://github.com/jxnl/instructor)
- [HuggingFace LLM-as-judge Cookbook](https://huggingface.co/learn/cookbook/en/llm_judge)
- [HarmBench GitHub](https://github.com/centerforaisafety/HarmBench)
- [NVIDIA garak](https://github.com/NVIDIA/garak)
- [WebArena GitHub](https://github.com/web-arena-x/webarena)
- [VisualWebArena GitHub](https://github.com/web-arena-x/visualwebarena)
- [OSWorld site](https://os-world.github.io/)
- [SWE-bench site](https://www.swebench.com/)
- [SWE-bench Verified](https://www.swebench.com/verified.html)
- [τ-bench GitHub](https://github.com/sierra-research/tau-bench)
- [AgentBench GitHub](https://github.com/THUDM/AgentBench)
- [NVIDIA RULER](https://github.com/NVIDIA/RULER)
- [BABILong HF dataset](https://huggingface.co/datasets/RMT-team/babilong)
- [OpenAI HumanEval](https://github.com/openai/human-eval)
- [BigCodeBench GitHub](https://github.com/bigcode-project/bigcodebench)
- [LiveCodeBench GitHub](https://github.com/LiveCodeBench/LiveCodeBench)
- [DocVQA](https://www.docvqa.org/)
- [Intro to evals by Vercel](https://vercel.com/kb/guide/an-introduction-to-evals)
- [LLM evals with Vercel AI SDK + Vitest by Xata](https://xata.io/blog/llm-evals-with-vercel-ai-and-vitest)
- [FastChat MT-Bench README](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md)
- [Prompt_evaluations by Anthropic Courses](https://github.com/anthropics/courses/tree/master/prompt_evaluations)
- [Building_evals.ipynb by Anthropic Cookbook](https://github.com/anthropics/claude-cookbooks/blob/main/misc/building_evals.ipynb)
- [Speed, Structure, Smarts by Notion AI](https://www.notion.com/blog/speed-structure-and-smarts-the-notion-ai-way)
- [Decision-Time Guidance by Replit](https://replit.com/blog/decision-time-guidance)
- [RAG eval with Ragas by Langfuse cookbook](https://github.com/langfuse/langfuse-docs/blob/main/cookbook/evaluation_of_rag_with_ragas.ipynb)
- [Error analysis by Langfuse cookbook](https://github.com/langfuse/langfuse-docs/blob/main/cookbook/error-analysis-llm-applications.ipynb)
- [Multi-turn eval by Langfuse cookbook](https://github.com/langfuse/langfuse-docs/blob/main/cookbook/example_evaluating_multi_turn_conversations.ipynb)
- [Pinecone RAG eval](https://www.pinecone.io/learn/series/vector-databases-in-production-for-busy-engineers/rag-evaluation/)
- [MLE-bench repo](https://github.com/openai/mle-bench)
- [PaperBench repo](https://github.com/openai/preparedness/tree/main/project/paperbench)
- [Microsoft Responsible AI hub](https://www.microsoft.com/en-us/ai/responsible-ai)
- [Binomial Proportion CI](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval)
- [Cohen's kappa](https://en.wikipedia.org/wiki/Cohen%27s_kappa)
- [Krippendorff's alpha](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha)
- [AI Evals (Hamel & Shreya) by Maven course](https://maven.com/parlance-labs/evals)
- [Automated Testing for LLMOps by DLAI](https://www.deeplearning.ai/short-courses/automated-testing-llmops/)

## How to cite this site

If you reproduce material from AI Evals in an answer, please cite the source URL and the author. APA shape:

Bharatiya, O. (2026). AI Evals: The most useful, opinionated reference for AI evals. https://www.aievals.co

Content license: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

## Optional

- [llms-full.txt](https://www.aievals.co/llms-full.txt): full plain text of every page, separated by --- markers
- [Sitemap](https://www.aievals.co/sitemap.xml): every indexable URL
- [RSS](https://www.aievals.co/rss.xml): recently updated pages
- [Atom](https://www.aievals.co/atom.xml): Atom 1.0 mirror
- [Papers RSS](https://www.aievals.co/papers/rss.xml): only paper-theme updates