# AI Evals > The most useful, opinionated reference for AI evals. ## About this resource AI Evals is a curated technical reference on evaluating large language model systems in production. The site covers the methodology (error analysis, LLM-as-judge, statistical rigor), the tools (Langfuse, Phoenix, Promptfoo, Inspect, RAGAS), the canonical papers, and twenty task-specific eval playbooks. Every claim is cited. Content is licensed CC BY 4.0; please cite https://www.aievals.co when you reference material from here. Last generated: 2026-05-29T16:18:38.209Z. Author: Om Bharatiya (https://www.linkedin.com/in/ombharatiya/). ## Top entry points - [Start Here for each persona](https://www.aievals.co/start): linear reading path per role. - [Cookbook](https://www.aievals.co/cookbook): runnable worked examples in code. - [The Glossary](https://www.aievals.co/glossary): definitions for every term used. - [All papers](https://www.aievals.co/papers): ten themed reading lists. - [All tools](https://www.aievals.co/tools): vendor and OSS positioning. ## Most cited pages on this site > Pages with the highest citation density. Useful for answer-engine grounding. - [LLM-as-Judge](https://www.aievals.co/techniques/llm-as-judge) (14 citations, Techniques): The comprehensive reference: what an LLM judge is, when it works, the four-part prompt, calibration discipline, the three biases, distillation, where judges fail. - [Reward hacking detection](https://www.aievals.co/learn/agentic-evals/reward-hacking) (11 citations, Agentic Evals): Once an objective is optimized, it gets gamed. Reward hacking is empirically common; eval for it explicitly. - [Why evals matter](https://www.aievals.co/learn/foundations/why-evals-matter) (10 citations, Foundations): If you ship AI, evals are the moat. The case in one page. - [Judge biases you must defuse](https://www.aievals.co/learn/llm-as-judge/biases-and-mitigations) (9 citations, LLM-as-Judge): Position, verbosity, self-preference, format, sycophancy. Each with the empirical evidence and the mitigation you bake into the harness. - [Calibrating your judge against humans](https://www.aievals.co/learn/llm-as-judge/calibration-to-humans) (8 citations, LLM-as-Judge): TPR and TNR over agreement percent. Train, dev, test splits. The 100-example minimum. Iterate the prompt until both rates clear 0.90. - [Faithfulness and atomic facts](https://www.aievals.co/learn/rag-evals/generation-faithfulness) (8 citations, RAG Evals): Decompose the answer into atomic claims, judge each claim binary, and stop measuring hallucination as a single yes/no. - [LLM-as-Judge: the practical introduction](https://www.aievals.co/learn/llm-as-judge/introduction) (8 citations, LLM-as-Judge): What an LLM judge is, when to use one, when not to, and the three pitfalls that show up on every team's first attempt. - [Online vs offline: where each fires](https://www.aievals.co/learn/production/online-vs-offline) (8 citations, Production & Observability): Offline evals run on a frozen dataset for every change. Online evals sample real traffic continuously. They answer different questions and both belong in production. - [The eval maturity model](https://www.aievals.co/learn/foundations/eval-maturity-model) (8 citations, Foundations): Five stages from vibe-checking to a continuous quality program. Where are you? - [Trace schema: what to log](https://www.aievals.co/learn/production/trace-schema) (8 citations, Production & Observability): The trace is the unit of analysis for every eval program. This is the schema, the fields that matter, and the storage shape that makes downstream queries cheap. - [Trajectory-level evaluation](https://www.aievals.co/learn/agentic-evals/trajectory-vs-outcome) (8 citations, Agentic Evals): Outcome-only scoring ranks agents in the wrong order. Score the trajectory and the result. - [Where LLM judges fail](https://www.aievals.co/learn/llm-as-judge/judgebench-honest-take) (8 citations, LLM-as-Judge): JudgeBench: GPT-4o is roughly random on hard correctness judgments. The implications for your pipeline are concrete, not theoretical. - [Build your trace viewer in an afternoon](https://www.aievals.co/learn/error-analysis/the-data-viewer) (7 citations, Error Analysis): The single highest-return tool for an eval program. Streamlit and Next.js variants, with the minimum feature set. - [Eval RAG in two layers](https://www.aievals.co/learn/rag-evals/retrieval-vs-generation) (7 citations, RAG Evals): Separate retrieval and generation or you cannot fix what breaks. - [LLM-judge calibration loop](https://www.aievals.co/cookbook/llm-judge-calibration) (7 citations, Cookbook): Push TPR and TNR past 0.90 by iterating the rubric, not the model. A complete loop adapted from the HuggingFace LLM-as-judge cookbook. ## Learn ### Foundations > Why evals matter, the maturity model, and what makes LLM evaluation hard. - [Why evals matter](https://www.aievals.co/learn/foundations/why-evals-matter): If you ship AI, evals are the moat. The case in one page. - [The eval maturity model](https://www.aievals.co/learn/foundations/eval-maturity-model): Five stages from vibe-checking to a continuous quality program. Where are you? ### Error Analysis > Open coding, dimensional sampling, and the 60-80% rule. - [The 60-80% rule](https://www.aievals.co/learn/error-analysis/the-60-80-rule): Three failure modes usually account for most of your bugs. Find them, fix them, repeat. - [Open coding for AI traces](https://www.aievals.co/learn/error-analysis/open-coding-workflow): Read traces, write notes, cluster notes into failure modes, prioritize fixes. The three-stage process and the LLM-assisted parts. - [Build your trace viewer in an afternoon](https://www.aievals.co/learn/error-analysis/the-data-viewer): The single highest-return tool for an eval program. Streamlit and Next.js variants, with the minimum feature set. ### LLM-as-Judge > Judge prompt structure, calibration against humans, biases and mitigations. - [LLM-as-Judge: the practical introduction](https://www.aievals.co/learn/llm-as-judge/introduction): What an LLM judge is, when to use one, when not to, and the three pitfalls that show up on every team's first attempt. - [The four-part judge prompt](https://www.aievals.co/learn/llm-as-judge/judge-prompt-structure): A copyable template: role and definitions, criteria, few-shot from your train set, output format. Plus the runnable invocation. - [Calibrating your judge against humans](https://www.aievals.co/learn/llm-as-judge/calibration-to-humans): TPR and TNR over agreement percent. Train, dev, test splits. The 100-example minimum. Iterate the prompt until both rates clear 0.90. - [Judge biases you must defuse](https://www.aievals.co/learn/llm-as-judge/biases-and-mitigations): Position, verbosity, self-preference, format, sycophancy. Each with the empirical evidence and the mitigation you bake into the harness. - [Where LLM judges fail](https://www.aievals.co/learn/llm-as-judge/judgebench-honest-take): JudgeBench: GPT-4o is roughly random on hard correctness judgments. The implications for your pipeline are concrete, not theoretical. ### Datasets & Golden Sets > Construction, versioning, lineage, and dimensional coverage. - [Constructing your eval dataset](https://www.aievals.co/learn/datasets/construction): How to build the first hundred labeled examples, then grow them past five hundred without losing signal. - [Versioning your eval set](https://www.aievals.co/learn/datasets/versioning-lineage): A content-hashed eval set pinned to every result is a contract with your past self. ### RAG Evals > Two-layer evaluation: retrieval metrics plus generation faithfulness. - [Eval RAG in two layers](https://www.aievals.co/learn/rag-evals/retrieval-vs-generation): Separate retrieval and generation or you cannot fix what breaks. - [Faithfulness and atomic facts](https://www.aievals.co/learn/rag-evals/generation-faithfulness): Decompose the answer into atomic claims, judge each claim binary, and stop measuring hallucination as a single yes/no. ### Agentic Evals > Trajectory vs outcome, tool-call evaluation, reward hacking, pass^k. - [Trajectory-level evaluation](https://www.aievals.co/learn/agentic-evals/trajectory-vs-outcome): Outcome-only scoring ranks agents in the wrong order. Score the trajectory and the result. - [Pass^k: the metric that catches inconsistent agents](https://www.aievals.co/learn/agentic-evals/pass-k-and-consistency): Pass@1 hides catastrophic inconsistency. Report pass^k or you do not know how often the agent will repeatedly succeed. - [Reward hacking detection](https://www.aievals.co/learn/agentic-evals/reward-hacking): Once an objective is optimized, it gets gamed. Reward hacking is empirically common; eval for it explicitly. ### Production & Observability > Online vs offline, trace schema, drift detection, CI integration. - [Online vs offline: where each fires](https://www.aievals.co/learn/production/online-vs-offline): Offline evals run on a frozen dataset for every change. Online evals sample real traffic continuously. They answer different questions and both belong in production. - [Trace schema: what to log](https://www.aievals.co/learn/production/trace-schema): The trace is the unit of analysis for every eval program. This is the schema, the fields that matter, and the storage shape that makes downstream queries cheap. ### Statistics & Rigor > Confidence intervals, bootstrap, power, inter-rater agreement. - [Confidence intervals: Wilson and Clopper-Pearson](https://www.aievals.co/learn/statistics/confidence-intervals): Wald is wrong. Wilson is the practical default. Clopper-Pearson is the conservative exact. With code and the case for why N=10 evals tell you almost nothing. - [Clustered standard errors: when CIs lie](https://www.aievals.co/learn/statistics/clustered-standard-errors): When eval questions cluster (multiple questions per topic, multiple turns per session), naive standard errors are three times too small. The cluster-aware variance formula and code, with the... ### Safety & Red-Team > OWASP LLM Top 10, HarmBench, AILuminate, jailbreaks and defenses. ### Governance, Risk, Compliance > EU AI Act, NIST AI RMF, RSPs, risk registers, trust artifacts. ## Techniques > Methodology reference: deeper than the Learn chapters, organized by technique. - [LLM-as-Judge](https://www.aievals.co/techniques/llm-as-judge): The comprehensive reference: what an LLM judge is, when it works, the four-part prompt, calibration discipline, the three biases, distillation, where judges fail. ## Tools > Eval platforms and libraries with positioning, license, and trade-offs. ## Papers > Curated reading list across ten themes, each paper annotated. - [Agentic eval](https://www.aievals.co/papers/agentic): Trajectory, outcome, and the harder benchmarks that came after AgentBench. WebArena, SWE-bench, GAIA, OSWorld, tau-bench, MLE-bench, AgentDojo, SWE-Lancer, BrowseComp, Cybench. - [Foundational eval theory](https://www.aievals.co/papers/foundational): What an eval is, what a benchmark is, and why rankings flip when nothing changed. The papers that built the vocabulary. - [LLM-as-Judge](https://www.aievals.co/papers/llm-as-judge): Building, calibrating, and stress-testing model judges. From G-Eval and MT-Bench through Prometheus 2 to JudgeBench. - [Long-context](https://www.aievals.co/papers/long-context): Why '1M tokens' rarely means a million tokens. Lost in the Middle, LongBench, RULER, BABILong, NoCha, Loong, NoLiMa. - [Methodology and production](https://www.aievals.co/papers/methodology-production): SPADE, RewardBench, MixEval, WildBench, IFEval, InfoBench, FollowBench, plus the 2025 agent-eval and observability work. - [Multimodal](https://www.aievals.co/papers/multimodal): MMBench, MM-Vet, MMMU, BLINK, Vibe-Eval, Video-MME, MMMU-Pro, ChartQA. - [RAG, retrieval, factuality](https://www.aievals.co/papers/rag-factuality): BEIR, FActScore, RAGAS, Self-RAG, ARES, SAFE, SimpleQA, HalluLens. - [Reasoning and math](https://www.aievals.co/papers/reasoning-and-math): From MATH and Chain-of-Thought to FrontierMath, ARC-AGI-2, Humanity's Last Exam, and the faithfulness problem. - [Safety, red-team, robustness](https://www.aievals.co/papers/safety-red-team): TruthfulQA, BBQ, Constitutional AI, GCG, sycophancy, HarmBench, sandbagging, alignment faking. - [Tool use and function calling](https://www.aievals.co/papers/tool-use): API-Bank, Gorilla, ToolBench, BFCL, TRAJECT-Bench, plus the tool-injection security work in AgentDojo. ## Cookbook > Runnable worked examples. Estimated minutes per recipe. - [Agent eval with τ-bench](https://www.aievals.co/cookbook/agent-eval-with-tau-bench): Run a subset of τ-bench's retail tasks against your agent and report pass@1 and pass^4. The consistency metric that catches agents that almost work. - [LLM-judge calibration loop](https://www.aievals.co/cookbook/llm-judge-calibration): Push TPR and TNR past 0.90 by iterating the rubric, not the model. A complete loop adapted from the HuggingFace LLM-as-judge cookbook. - [Multi-turn conversation eval](https://www.aievals.co/cookbook/multi-turn-conversation-eval): Per-turn binary rubrics plus a per-session state check, logged to Langfuse, for an assistant that has to remember context and follow rules across turns. - [RAG eval with Ragas](https://www.aievals.co/cookbook/rag-eval-with-ragas): A two-layer RAG eval (retrieval plus generation faithfulness) using Ragas on a small synthetic corpus. Runs locally with the OpenAI Python SDK. - [Your first eval in 30 minutes](https://www.aievals.co/cookbook/your-first-eval-30-minutes): Pick one feature, label fifty traces, write a binary judge, run it, read the confusion matrix. A complete walkthrough using the Anthropic Python SDK. ## Task types > Twenty task-specific eval playbooks, one per task shape. ## Start Here > Linear numbered reading paths, one track per persona. - [Start Here for CTOs and Directors of Technology](https://www.aievals.co/start/cto): Ten steps from the executive case for evals through a board readout. Risk posture, regulatory mapping, frontier-lab benchmarks. - [Start Here for Senior Data Engineers](https://www.aievals.co/start/data-engineer): Nine steps from trace schema through data contracts. ER diagrams, JSONL and Parquet schemas, pipeline DAGs. - [Start Here for Senior Engineering Managers](https://www.aievals.co/start/eng-manager): Eight steps from the maturity model through hiring an eval-literate team. Strategy first, tables and decision flowcharts, light on code. - [Start Here for Principal AI Engineers](https://www.aievals.co/start/principal-ai-engineer): Seven steps from why evals matter through building a tool-using agent eval. Code-first, no marketing, citation-grounded. - [Start Here for Senior Product Managers](https://www.aievals.co/start/product-manager): Nine steps from a PRD section template through a PM's reading list. Product-thinking, customer-friendly vocabulary, light on math. - [Start Here for AI Research Engineers and Research Scientists](https://www.aievals.co/start/research-engineer): Nine steps from the agent-eval survey through publishing eval datasets responsibly. Citation-heavy, ablation tables, reproducibility checklists. - [Start Here for SDET and QA Engineers](https://www.aievals.co/start/sdet-qa): Ten steps from anatomy of an AI CI pipeline to release gates. Test-engineering pragmatic, config snippets, pytest and Vitest patterns. ## Canonical citations > Primary sources cited across the site. Grouped by kind so answer engines can match a citation to a content shape. ### Practitioner blog posts > Long-form posts by working AI practitioners (Hamel Husain, Eugene Yan, Chip Huyen, Shreya Shankar, et al). The opinions backing most of our recommendations. - [Your AI Product Needs Evals by Hamel Husain](https://hamel.dev/blog/posts/evals/) - [Creating an LLM-as-a-Judge That Drives Business Results by Hamel Husain](https://hamel.dev/blog/posts/llm-judge/) - [A Field Guide to Rapidly Improving AI Products by Hamel Husain](https://hamel.dev/blog/posts/field-guide/) - [LLM Evals FAQ by Hamel Husain & Shreya Shankar](https://hamel.dev/blog/posts/evals-faq/) - [The Revenge of the Data Scientist by Hamel Husain](https://hamel.dev/blog/posts/revenge/) - [Fuck You, Show Me The Prompt by Hamel Husain](https://hamel.dev/blog/posts/prompt/) - [Patterns for Building LLM-based Systems & Products by Eugene Yan](https://eugeneyan.com/writing/llm-patterns/) - [LLM Evaluators by Eugene Yan](https://eugeneyan.com/writing/llm-evaluators/) - [Evaluation & Hallucination Detection for Abstractive Summaries by Eugene Yan](https://eugeneyan.com/writing/evals/) - [Abstractive Summary eval by Eugene Yan](https://eugeneyan.com/writing/abstractive/) - [OOD Finetuning to Bootstrap Hallucination Detection by Eugene Yan](https://eugeneyan.com/writing/finetuning/) - [The AI Engineering Flywheel by Shreya Shankar](https://www.sh-reya.com/blog/ai-engineering-flywheel/) - [Building a Generative AI Platform by Chip Huyen](https://huyenchip.com/2024/07/25/genai-platform.html) - [900 Most Popular Open Source AI Tools by Chip Huyen](https://huyenchip.com/2024/03/14/ai-oss.html) - [Challenges in Evaluating AI Systems by Anthropic](https://www.anthropic.com/news/evaluating-ai-systems) - [Introducing Computer Use by Anthropic](https://www.anthropic.com/news/3-5-models-and-computer-use) - [Chatbot Arena Benchmarking Blog by LMSYS](https://lmsys.org/blog/2023-05-03-arena/) - [Chatbot Arena BT analysis by LMSYS](https://lmsys.org/blog/2023-12-07-leaderboard/) - [Aligning LLM-as-Judge w/ Human Preferences by LangChain](https://www.langchain.com/blog/aligning-llm-as-a-judge-with-human-preferences) ### Papers > Academic and lab-published papers on eval methodology, benchmarks, and judge calibration. - [MMLU (2020) by Hendrycks et al](https://arxiv.org/abs/2009.03300) - [Beyond Accuracy: CheckList (2020) by Ribeiro et al](https://arxiv.org/abs/2005.04118) - [BIG-Bench (2022) by Srivastava et al](https://arxiv.org/abs/2206.04615) - [BBH (2022) by Suzgun et al](https://arxiv.org/abs/2210.09261) - [HELM (2022) by Liang et al](https://arxiv.org/abs/2211.09110) - [The Benchmark Lottery (2021) by Dehghani, Tay et al](https://arxiv.org/abs/2107.07002) - [Stop Uploading Test Data in Plain Text (2023) by Jacovi et al](https://arxiv.org/abs/2305.10160) - [Are We Done with MMLU? (2024) by Gema et al](https://arxiv.org/abs/2406.04127) - [Adding Error Bars to Evals (2024) by Evan Miller](https://arxiv.org/abs/2411.00640) - [G-Eval (2023) by Liu et al](https://arxiv.org/abs/2303.16634) - [Judging LLM-as-a-Judge with MT-Bench (2023) by Zheng et al](https://arxiv.org/abs/2306.05685) - [Prometheus (2023) by Kim et al](https://arxiv.org/abs/2310.08491) - [Auto-J (2023) by Li et al](https://arxiv.org/abs/2310.05470) - [JudgeLM (2023) by Zhu et al](https://arxiv.org/abs/2310.17631) - [Who Validates the Validators? (2024) by Shankar et al](https://arxiv.org/abs/2404.12272) - [Prometheus 2 (2024) by Kim et al](https://arxiv.org/abs/2405.01535) - [Judging the Judges (2024) by Thakur et al](https://arxiv.org/abs/2406.12624) - [JudgeBench (2024) by Tan et al](https://arxiv.org/abs/2410.12784) - [Chatbot Arena (2024) by Chiang et al](https://arxiv.org/abs/2403.04132) - [Arena Hard / BenchBuilder (2024) by Li et al](https://arxiv.org/abs/2406.11939) - [LLM Comparator (2024) by Kahng et al](https://arxiv.org/abs/2402.10524) - [Length-Controlled AlpacaEval (2024) by Dubois et al](https://arxiv.org/abs/2404.04475) - [AgentBench (2023) by Liu et al](https://arxiv.org/abs/2308.03688) - [WebArena (2023) by Zhou et al](https://arxiv.org/abs/2307.13854) - [VisualWebArena (2024) by Koh et al](https://arxiv.org/abs/2401.13649) - [WorkArena (2024) by Drouin et al](https://arxiv.org/abs/2403.07718) - [SWE-bench (2023) by Jimenez et al](https://arxiv.org/abs/2310.06770) - [SWE-bench Multimodal (2024) by Yang et al](https://arxiv.org/abs/2410.03859) - [GAIA (2023) by Mialon et al](https://arxiv.org/abs/2311.12983) - [AgentBoard (2024) by Ma et al](https://arxiv.org/abs/2401.13178) - [OSWorld (2024) by Xie et al](https://arxiv.org/abs/2404.07972) - [τ-bench (2024) by Yao et al](https://arxiv.org/abs/2406.12045) - [MLE-bench (2024) by Chan et al](https://arxiv.org/abs/2410.07095) - [ScienceAgentBench (2024) by Chen et al](https://arxiv.org/abs/2410.05080) - [AgentDojo (2024) by Debenedetti et al](https://arxiv.org/abs/2406.13352) - [SWE-Lancer (2025) by Miserendino et al](https://arxiv.org/abs/2502.12115) - [BrowseComp (2025) by Wei et al](https://arxiv.org/abs/2504.12516) - [Cybench (2024) by Zhang et al](https://arxiv.org/abs/2408.08926) - [LLM Agent Eval Survey (2025) by Yehudai et al](https://arxiv.org/abs/2503.16416) - [Agent Eval Survey (2025) by Mohammadi et al](https://arxiv.org/abs/2507.21504) - [Agent-as-a-Judge (2024) by Zhuge et al](https://arxiv.org/abs/2410.10934) - [TRAJECT-Bench (2025) by He et al](https://arxiv.org/abs/2510.04550) - [Beyond Black-Box Benchmarking (2025) by Moshkovich et al](https://arxiv.org/abs/2503.06745) - [In-Context Reward Hacking (2024) by Pan et al](https://arxiv.org/abs/2402.06627) - [Concrete Problems in AI Safety (2016) by Amodei et al](https://arxiv.org/abs/1606.06565) - [AI Sandbagging (2024) by van der Weij et al](https://arxiv.org/abs/2406.07358) - [Alignment Faking (2024) by Greenblatt et al](https://arxiv.org/abs/2412.14093) - [Lost in the Middle (2023) by Liu et al](https://arxiv.org/abs/2307.03172) - [LongBench (2023) by Bai et al](https://arxiv.org/abs/2308.14508) - [∞Bench (2024) by Zhang et al](https://arxiv.org/abs/2402.13718) - [RULER (2024) by Hsieh et al](https://arxiv.org/abs/2404.06654) - [BABILong (2024) by Kuratov et al](https://arxiv.org/abs/2406.10149) - [One Thousand and One Pairs (2024) by Karpinska et al](https://arxiv.org/abs/2406.16264) - [Loong (2024) by Wang et al](https://arxiv.org/abs/2406.17419) - [LongBench v2 (2024) by Bai et al](https://arxiv.org/abs/2412.15204) - [NoLiMa (2025) by Modarressi et al](https://arxiv.org/abs/2502.05167) - [On the Measure of Intelligence (2019) by Chollet](https://arxiv.org/abs/1911.01547) - [MATH (2021) by Hendrycks et al](https://arxiv.org/abs/2103.03874) - [Chain-of-Thought (2022) by Wei et al](https://arxiv.org/abs/2201.11903) - [Self-Consistency (2022) by Wang et al](https://arxiv.org/abs/2203.11171) - [Let's Verify Step by Step (2023) by Lightman et al](https://arxiv.org/abs/2305.20050) - [GPQA (2023) by Rein et al](https://arxiv.org/abs/2311.12022) - [Skill-Mix (2023) by Yu et al](https://arxiv.org/abs/2310.17567) - [FrontierMath (2024) by Glazer et al](https://arxiv.org/abs/2411.04872) - [ARC-AGI-2 (2025) by Chollet et al](https://arxiv.org/abs/2505.11831) - [Humanity's Last Exam (2025) by Phan et al](https://arxiv.org/abs/2501.14249) - [TruthfulQA (2021) by Lin et al](https://arxiv.org/abs/2109.07958) - [BBQ (2021) by Parrish et al](https://arxiv.org/abs/2110.08193) - [Red Teaming Language Models (2022) by Ganguli et al](https://arxiv.org/abs/2209.07858) - [Constitutional AI (2022) by Bai et al](https://arxiv.org/abs/2212.08073) - [Universal Adversarial Attacks (2023) by Zou et al](https://arxiv.org/abs/2307.15043) - [Towards Understanding Sycophancy (2023) by Sharma et al](https://arxiv.org/abs/2310.13548) - [Quantifying LM Sensitivity to Spurious Features (2023) by Sclar et al](https://arxiv.org/abs/2310.11324) - [HarmBench (2024) by Mazeika et al](https://arxiv.org/abs/2402.04249) - [BOLD (2021) by Dhamala et al](https://arxiv.org/abs/2101.11718) - [HaluEval (2023) by Li et al](https://arxiv.org/abs/2305.11747) - [BEIR (2021) by Thakur et al](https://arxiv.org/abs/2104.08663) - [FActScore (2023) by Min et al](https://arxiv.org/abs/2305.14251) - [RAGAS (2023) by Es et al](https://arxiv.org/abs/2309.15217) - [Self-RAG (2023) by Asai et al](https://arxiv.org/abs/2310.11511) - [ARES (2023) by Saad-Falcon et al](https://arxiv.org/abs/2311.09476) - [SAFE (2024) by Wei et al](https://arxiv.org/abs/2403.18802) - [SimpleQA (2024) by Wei et al](https://arxiv.org/abs/2411.04368) - [HalluLens (2025) by Bang et al](https://arxiv.org/abs/2504.17550) - [MMBench (2023) by Liu et al](https://arxiv.org/abs/2307.06281) - [MM-Vet (2023) by Yu et al](https://arxiv.org/abs/2308.02490) - [MMMU (2023) by Yue et al](https://arxiv.org/abs/2311.16502) - [BLINK (2024) by Fu et al](https://arxiv.org/abs/2404.12390) - [Vibe-Eval (2024) by Padlewski et al](https://arxiv.org/abs/2405.02287) - [Video-MME (2024) by Fu et al](https://arxiv.org/abs/2405.21075) - [MMMU-Pro (2024) by Yue et al](https://arxiv.org/abs/2409.02813) - [ChartQA (2022) by Masry et al](https://arxiv.org/abs/2203.10244) - [DocVQA (2020) by Mathew et al](https://arxiv.org/abs/2007.00398) - [API-Bank (2023) by Li et al](https://arxiv.org/abs/2304.08244) - [Gorilla (2023) by Patil et al](https://arxiv.org/abs/2305.15334) - [ToolLLM/ToolBench (2023) by Qin et al](https://arxiv.org/abs/2307.16789) - [SPADE (2024) by Shankar et al](https://arxiv.org/abs/2401.03038) - [RewardBench (2024) by Lambert et al](https://arxiv.org/abs/2403.13787) - [MixEval (2024) by Ni et al](https://arxiv.org/abs/2406.06565) - [WildBench (2024) by Lin et al](https://arxiv.org/abs/2406.04770) - [InfoBench (2024) by Qin et al](https://arxiv.org/abs/2401.03601) - [IFEval (2023) by Zhou et al](https://arxiv.org/abs/2311.07911) - [FollowBench (2023) by Jiang et al](https://arxiv.org/abs/2310.20410) - [Data-Centric AI for LLMs (2024) by Xu et al](https://arxiv.org/abs/2406.14473) - [HumanEval (2021) by Chen et al](https://arxiv.org/abs/2107.03374) - [MBPP (2021) by Austin et al](https://arxiv.org/abs/2108.07732) - [APPS (2021) by Hendrycks et al](https://arxiv.org/abs/2105.09938) - [HumanEval+/MBPP+ (2023) by Liu et al](https://arxiv.org/abs/2305.01210) - [LiveCodeBench (2024) by Jain et al](https://arxiv.org/abs/2403.07974) - [BigCodeBench (2024) by Zhuo et al](https://arxiv.org/abs/2406.15877) - [RepoBench (2023) by Liu, Xu, McAuley](https://arxiv.org/abs/2306.03091) - [CRUXEval (2024) by Gu et al](https://arxiv.org/abs/2401.03065) - [ClassEval (2023) by Du et al](https://arxiv.org/abs/2308.01861) - [Multi-SWE-bench (2025) by Zan et al](https://arxiv.org/abs/2504.02605) - [CLIP (2021) by Radford et al](https://arxiv.org/abs/2103.00020) - [FID (Two Time-Scale Update Rule) (2017) by Heusel et al](https://arxiv.org/abs/1706.08500) - [SimCSE (2021) by Gao, Yao, Chen](https://arxiv.org/abs/2104.08821) - [WMT24 Findings](https://aclanthology.org/2024.wmt-1.1/) ### Tool documentation > Vendor and open-source eval platform documentation. - [Langfuse Docs](https://langfuse.com/docs) - [LangSmith Evaluation Concepts](https://docs.langchain.com/langsmith/evaluation-concepts) - [Langfuse Scores](https://langfuse.com/docs/scores/overview) - [Braintrust Docs](https://www.braintrust.dev/docs) - [Phoenix Tutorials/Evals](https://github.com/Arize-ai/phoenix/tree/main/tutorials/evals) - [Phoenix Cookbook Evaluation](https://arize.com/docs/phoenix/cookbook/evaluation) - [Inspect AI Docs](https://inspect.aisi.org.uk/) - [RAGAS Docs](https://docs.ragas.io/) - [RAG eval by RAGAS Getting Started](https://docs.ragas.io/en/stable/getstarted/rag_eval/) - [RAGAS Faithfulness metric](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/) - [TruLens Docs](https://www.trulens.org/) - [Promptfoo Docs](https://www.promptfoo.dev/) - [Promptfoo Red-team](https://www.promptfoo.dev/docs/red-team/) - [Evaluate an LLM Application by LangSmith](https://docs.langchain.com/langsmith/evaluate-llm-application) ### Standards and regulations > NIST, EU AI Act, OWASP, ITU, MLCommons. - [ITU-T P.808. Subjective MOS via crowdsourcing](https://www.itu.int/rec/T-REC-P.808) - [MLCommons AILuminate](https://mlcommons.org/benchmarks/ailuminate/) - [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework) - [official portal by EU AI Act](https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai) - [OWASP Top 10 for LLM Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications/) ### Other sources - [Yan, Bischof, Frye, Husain, Liu, Shankar by Applied LLMs](https://applied-llms.org/) - [LLM Powered Autonomous Agents by Lilian Weng](https://lilianweng.github.io/posts/2023-06-23-agent/) - [Extrinsic Hallucinations in LLMs by Lilian Weng](https://lilianweng.github.io/posts/2024-07-07-hallucination/) - [Adversarial Attacks on LLMs by Lilian Weng](https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/) - [Specification Gaming Examples by Victoria Krakovna](https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/) - [LLM Evaluation Guidebook by Clémentine Fourrier](https://huggingface.co/blog/clefourrier/llm-evaluation) - [Evaluating MMLU by HuggingFace](https://huggingface.co/blog/evaluating-mmlu-leaderboard) - [Hard Stuff Nobody Talks About by Honeycomb](https://www.honeycomb.io/blog/hard-stuff-nobody-talks-about-llm) - [Building Effective Agents by Anthropic](https://www.anthropic.com/engineering/building-effective-agents) - [SWE-bench Sonnet by Anthropic](https://www.anthropic.com/research/swe-bench-sonnet) - [Measuring Faithfulness in CoT by Anthropic](https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning) - [Reasoning Models Don't Say What They Think by Anthropic](https://www.anthropic.com/research/reasoning-models-dont-say-think) - [Measuring Persuasiveness of LMs by Anthropic](https://www.anthropic.com/research/measuring-model-persuasiveness) - [Anthropic Responsible Scaling Policy](https://www.anthropic.com/responsible-scaling-policy) - [Agentic Misalignment by Anthropic](https://www.anthropic.com/research/agentic-misalignment) - [Adding Error Bars to Evals (blog) by Anthropic](https://www.anthropic.com/research/statistical-approach-to-model-evals) - [Anthropic Tool Use Overview (strict mode)](https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/overview) - [Anthropic Evaluation Tool (Console)](https://platform.claude.com/docs/en/test-and-evaluate/eval-tool) - [Define success criteria and build evaluations by Anthropic](https://platform.claude.com/docs/en/docs/test-and-evaluate/develop-tests) - [OpenAI Evals (GitHub)](https://github.com/openai/evals) - [Introducing SWE-bench Verified by OpenAI](https://openai.com/index/introducing-swe-bench-verified/) - [OpenAI Preparedness Framework](https://openai.com/safety/preparedness/) - [A Practical Guide to Building Agents by OpenAI](https://openai.com/business/guides-and-resources/a-practical-guide-to-building-ai-agents/) - [Getting Started with OpenAI Evals by OpenAI Cookbook](https://developers.openai.com/cookbook/examples/evaluation/getting_started_with_openai_evals) - [Eval LLMs for SQL Generation by OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/evaluation/How_to_evaluate_LLMs_for_SQL_generation.ipynb) - [Evaluate RAG with LlamaIndex by OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/evaluation/Evaluate_RAG_with_LlamaIndex.ipynb) - [OpenAI Preparedness GitHub (MLE-bench, SWE-Lancer, PaperBench)](https://github.com/openai/preparedness) - [OpenAI SWE-Lancer Benchmark](https://github.com/openai/swelancer-benchmark) - [Needle in a Haystack by Greg Kamradt](https://github.com/gkamradt/LLMTest_NeedleInAHaystack) - [BFCL (Patil, Mao et al., ICML 2025) (2025)](https://gorilla.cs.berkeley.edu/leaderboard.html) - [BIRD-bench](https://bird-bench.github.io/) - [Spider by Yu et al](https://yale-lily.github.io/spider) - [PartiPrompts by Google Research](https://github.com/google-research/parti) - [MTEB Benchmark](https://github.com/embeddings-benchmark/mteb) - [Whisper repo (Common Voice 15, Fleurs WER) by OpenAI](https://github.com/openai/whisper) - [Comet Opik](https://github.com/comet-ml/opik) - [Inspect Evals Registry](https://github.com/UKGovernmentBEIS/inspect_evals) - [Inspect AI GitHub](https://github.com/UKGovernmentBEIS/inspect_ai) - [RAGAS GitHub](https://github.com/explodinggradients/ragas) - [DeepEval GitHub](https://github.com/confident-ai/deepeval) - [Giskard GitHub](https://github.com/Giskard-AI/giskard) - [LlamaIndex Evaluation](https://developers.llamaindex.ai/python/framework/module_guides/evaluating/) - [Outlines GitHub](https://github.com/outlines-dev/outlines) - [Instructor GitHub](https://github.com/jxnl/instructor) - [HuggingFace LLM-as-judge Cookbook](https://huggingface.co/learn/cookbook/en/llm_judge) - [HarmBench GitHub](https://github.com/centerforaisafety/HarmBench) - [NVIDIA garak](https://github.com/NVIDIA/garak) - [WebArena GitHub](https://github.com/web-arena-x/webarena) - [VisualWebArena GitHub](https://github.com/web-arena-x/visualwebarena) - [OSWorld site](https://os-world.github.io/) - [SWE-bench site](https://www.swebench.com/) - [SWE-bench Verified](https://www.swebench.com/verified.html) - [τ-bench GitHub](https://github.com/sierra-research/tau-bench) - [AgentBench GitHub](https://github.com/THUDM/AgentBench) - [NVIDIA RULER](https://github.com/NVIDIA/RULER) - [BABILong HF dataset](https://huggingface.co/datasets/RMT-team/babilong) - [OpenAI HumanEval](https://github.com/openai/human-eval) - [BigCodeBench GitHub](https://github.com/bigcode-project/bigcodebench) - [LiveCodeBench GitHub](https://github.com/LiveCodeBench/LiveCodeBench) - [DocVQA](https://www.docvqa.org/) - [Intro to evals by Vercel](https://vercel.com/kb/guide/an-introduction-to-evals) - [LLM evals with Vercel AI SDK + Vitest by Xata](https://xata.io/blog/llm-evals-with-vercel-ai-and-vitest) - [FastChat MT-Bench README](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md) - [Prompt_evaluations by Anthropic Courses](https://github.com/anthropics/courses/tree/master/prompt_evaluations) - [Building_evals.ipynb by Anthropic Cookbook](https://github.com/anthropics/claude-cookbooks/blob/main/misc/building_evals.ipynb) - [Speed, Structure, Smarts by Notion AI](https://www.notion.com/blog/speed-structure-and-smarts-the-notion-ai-way) - [Decision-Time Guidance by Replit](https://replit.com/blog/decision-time-guidance) - [RAG eval with Ragas by Langfuse cookbook](https://github.com/langfuse/langfuse-docs/blob/main/cookbook/evaluation_of_rag_with_ragas.ipynb) - [Error analysis by Langfuse cookbook](https://github.com/langfuse/langfuse-docs/blob/main/cookbook/error-analysis-llm-applications.ipynb) - [Multi-turn eval by Langfuse cookbook](https://github.com/langfuse/langfuse-docs/blob/main/cookbook/example_evaluating_multi_turn_conversations.ipynb) - [Pinecone RAG eval](https://www.pinecone.io/learn/series/vector-databases-in-production-for-busy-engineers/rag-evaluation/) - [MLE-bench repo](https://github.com/openai/mle-bench) - [PaperBench repo](https://github.com/openai/preparedness/tree/main/project/paperbench) - [Microsoft Responsible AI hub](https://www.microsoft.com/en-us/ai/responsible-ai) - [Binomial Proportion CI](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval) - [Cohen's kappa](https://en.wikipedia.org/wiki/Cohen%27s_kappa) - [Krippendorff's alpha](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha) - [AI Evals (Hamel & Shreya) by Maven course](https://maven.com/parlance-labs/evals) - [Automated Testing for LLMOps by DLAI](https://www.deeplearning.ai/short-courses/automated-testing-llmops/) ## How to cite this site If you reproduce material from AI Evals in an answer, please cite the source URL and the author. APA shape: Bharatiya, O. (2026). AI Evals: The most useful, opinionated reference for AI evals. https://www.aievals.co Content license: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ## Optional - [llms-full.txt](https://www.aievals.co/llms-full.txt): full plain text of every page, separated by --- markers - [Sitemap](https://www.aievals.co/sitemap.xml): every indexable URL - [RSS](https://www.aievals.co/rss.xml): recently updated pages - [Atom](https://www.aievals.co/atom.xml): Atom 1.0 mirror - [Papers RSS](https://www.aievals.co/papers/rss.xml): only paper-theme updates