Glossary

Every term you need to read the rest of the site: benchmarks, metrics, judge concepts, statistical machinery, agentic eval primitives.

The eval vocabulary is fragmented across vendor blogs, arXiv preprints, and academic papers. This page collects the working definitions used across aievals.co. Each term gets a short definition (1 to 2 sentences) and "See also" links to the chapters where the concept is treated in depth. Where a term has multiple competing definitions in the literature, the version used on this site is stated and the reasoning is on the linked chapter.

Coverage runs from foundational concepts (calibration, confidence intervals, inter-rater agreement) to specific benchmarks (MMLU, GPQA, SWE-bench, τ-bench, BFCL) to judge mechanics (rubric, anchor, pairwise, position bias) to RAG primitives (faithfulness, atomic fact, Recall@K) and agentic eval (trajectory, pass^k, reward hacking, tool-call F1). If a term is missing and you think it belongs, the contributing page describes how to add one. The bar is "a working practitioner needs this term to read the rest of the site."

Jump to a letter: A · B · C · D · E · F · G · H · I · J · K · L · M · N · O · P · R · S · T · V · W · X

A

Accuracy

The fraction of predictions that match the ground-truth label, computed across a labeled eval set. Accuracy is fine for balanced classification problems and useless for skewed ones; on a dataset that is 95% positives a model that always predicts "positive" scores 95% without learning anything. Most production AI evals report task-specific metrics in addition to or instead of accuracy. See also: F1 score, confidence intervals.

Agentic eval

Evaluation of systems where a language model plans, calls tools, observes results, and iterates until a task is done. Agentic eval has to grade more than the final answer: the trajectory of tool calls, the sub-goal decomposition, and the consistency of behavior across reruns all matter. See also: trajectory vs outcome, tool-call evaluation, pass^k.

AILuminate

MLCommons' v1 safety benchmark for general-purpose chat assistants, released in 2024. It scores models on 12 hazard categories using a calibrated LLM-judge ensemble and reports per-category grades (Poor through Excellent) so teams can compare models on a public scale. See also: OWASP LLM Top 10, red-team program design.

Anchor

A reference example included in a judge prompt or eval rubric to set the scale before the judge grades a new sample. Anchors fight a common LLM-judge failure: without a worked example of "great", "okay", and "bad", the judge invents its own scale and that scale drifts across runs. See also: judge prompt structure, rubric.

Anchoring bias

The pull a judge feels toward whichever answer it sees first in a pairwise comparison or whichever response gets graded first in a batch. The fix is to randomize order on every call and report agreement after position swap. See also: biases and mitigations, position bias.

Annotation

The act of attaching a label, score, or note to a trace, sample, or model output. Annotation work is the bottleneck of every serious eval program; a custom trace viewer that makes annotation fast pays for itself in a week. See also: open coding workflow, the data viewer.

ARC-AGI

The "Abstraction and Reasoning Corpus" benchmark, designed by François Chollet to measure fluid reasoning rather than memorized pattern matching. Each task shows a few input-output grid pairs and asks for the rule; humans score around 80%, frontier LLMs as of 2026 score in the 40-60% range on the public set. See also: GPQA, reasoning benchmarks.

Atomic fact

A claim small enough that it can be verified true or false against a reference without needing to decompose further. Generation faithfulness metrics (FActScore, SAFE, RAGAS faithfulness) work by decomposing a model output into atomic facts and verifying each one independently. See also: generation faithfulness, FActScore.

B

Bandit (multi-armed)

An online sampling strategy where you treat each candidate model, prompt, or configuration as an "arm" and allocate traffic proportional to estimated reward, balancing exploration against exploitation. Used in production eval to route a small fraction of traffic to a new model variant without committing to a full A/B test. See also: online vs offline.

BERTScore

A reference-based metric that compares a candidate output to a reference using contextual BERT embeddings rather than n-gram overlap, so paraphrases still score well. Popular for summarization eval; superseded for most use cases by COMET and learned metrics. See also: reference-based metric, COMET.

BFCL

The Berkeley Function-Calling Leaderboard, a benchmark of function/tool-use ability for LLMs. Tracks pass rates across simple calls, multiple calls, parallel calls, and irrelevant-call detection (the model must decline to use a tool when none fits). See also: tool-call evaluation.

Bias (judge)

A systematic deviation in LLM-judge scores tied to surface features of the response rather than its quality: position, length, format, the judge's preference for its own family of models. The five most-cited biases are position, verbosity, self-preference, format sensitivity, and sycophancy. See also: biases and mitigations.

BLEU

Bilingual Evaluation Understudy, a precision-of-n-grams metric for machine translation introduced in 2002. BLEU correlates poorly with human judgment for open-ended generation; it is still used in translation eval but is a poor default for chat or summarization. See also: reference-based metric, ROUGE.

Bootstrap

A resampling technique: draw samples with replacement from your eval set, recompute the metric on each resample, and use the spread to build a confidence interval. The standard answer for "I don't know the parametric distribution of my metric." See also: confidence intervals, bootstrap and paired tests.

Bradley-Terry

A statistical model that converts pairwise comparison outcomes into a per-item skill score. Chatbot Arena's underlying Elo-style ranking uses a Bradley-Terry fit so the leaderboard converges instead of drifting. See also: Elo, pairwise.

C

Calibration

Two distinct meanings on this site. (1) Probabilistic calibration: a model that says "70% sure" should be right 70% of the time across many such predictions. (2) Judge calibration: tuning a judge prompt against a labeled gold set until its true positive and true negative rates against the principal expert clear an agreement threshold (usually 90%). See also: calibration to humans.

CheckList

A behavioral testing framework for NLP (Ribeiro et al., 2020) that asks practitioners to write unit-style tests over linguistic capabilities (negation, numerical reasoning, fairness) rather than reporting a single aggregate score. See also: failure mode taxonomies.

Chunking

The step in a RAG pipeline that splits source documents into smaller pieces before embedding and indexing. Chunk size and overlap directly affect retrieval recall; chunks that are too small lose context, chunks that are too large dilute the embedding. See also: retrieval vs generation.

Cohen's kappa

An inter-rater agreement statistic that adjusts for the agreement you would expect by chance. Useful for two raters labeling a binary or categorical schema; reports values from 1 (perfect) through 0 (chance) to negative (worse than chance). See also: inter-rater agreement, Krippendorff's alpha.

COMET

A learned, neural reference-based metric for machine translation that fine-tunes a multilingual encoder on direct-assessment scores. COMET correlates with human judgment far better than BLEU on modern systems and is the default MT eval in most published benchmarks since 2020. See also: BLEU, XCOMET.

Confidence interval

A range of values that, with a stated probability (usually 95%), contains the true population parameter. For binary eval metrics the Wilson score interval is the right default; the textbook Wald interval is wrong at small N and on extreme proportions. See also: confidence intervals, Wilson interval.

Contamination

A failure mode where eval data has leaked into model training data, inflating the reported score. Contamination is rampant on public benchmarks (MMLU, GSM8K) for models trained on web-scale corpora; the defense is held-out private sets and contamination-detection probes. See also: non-determinism and subjectivity, HumanEval.

Continuous evaluation

The practice of running evals on a sample of production traffic on a recurring schedule (every commit, every hour, every release) rather than only before a release. Shreya Shankar's "AI engineering flywheel" framing makes continuous eval the gating loop on the whole production process. See also: online vs offline, eval maturity model.

CoT (Chain-of-thought)

A prompting style that asks the model to write out reasoning steps before producing the final answer. CoT helps on multi-step reasoning, hurts on simple classification, and creates its own eval problem: chain-of-thought outputs are not always faithful to the actual computation the model performed. See also: reward hacking.

CRUXEval

A code-reasoning benchmark from MIT (Gu et al., 2024) that asks models to predict outputs of short Python programs (output prediction) and inputs that produce a given output (input prediction). Distinguishes generation ability from reasoning ability. See also: HumanEval, SWE-bench.

D

Dataset versioning

The practice of treating your eval dataset as a contract: every version is content-hashed, every reported metric is pinned to a version, and migrations are documented. A v3.1 eval set with 500 examples is a different thing from v3.2 with 502 examples, and the historical numbers should never mix. See also: versioning and lineage.

Distillation (judge)

Training a smaller, cheaper model to imitate a frontier judge so it can be deployed inline in production. Patronus Lynx, Galileo Luna, Vectara HHEM, Prometheus, and JudgeLM are all distilled judges; the architecture pattern is inline distilled judge plus periodic frontier calibration plus a human gold set. See also: Prometheus, JudgeLM.

Drift

A change in input distribution, model behavior, or both that degrades eval scores over time. Two categories: data drift (the inputs your users send shift), and model drift (the upstream provider silently updates the model). Both are caught by continuous eval, not by point-in-time benchmarks. See also: online vs offline, environmental drift.

E

Effect size

The magnitude of a difference between two systems, separate from whether that difference is statistically significant. A model that scores 0.79 vs 0.78 may be "significantly" better at N=10,000 but the effect size is small; report Cohen's d, accuracy delta, or win rate rather than a p-value alone. See also: statistical power.

Elo

A relative skill rating system originally designed for chess, adapted to LLM eval by LMArena and others. After every pairwise match the two ratings update by a function of the expected vs actual outcome; the system converges to a ranking even without an absolute ground truth. See also: LMArena, Bradley-Terry.

Error analysis

The discipline of reading actual model traces, writing notes on what went wrong, clustering those notes into failure modes, and prioritizing fixes by frequency times severity. Hamel Husain's "60-80% rule" claims that three categories usually account for the bulk of bugs. See also: the 60-80% rule, open coding workflow.

Eval (offline)

An eval run on a static labeled dataset, typically on a developer machine or in CI, before a release. Fast feedback, controlled inputs, fully reproducible. The complement of online eval; both are required. See also: online vs offline.

Eval (online)

An eval run on a sample of live production traffic, often using lightweight judges or rule-based checks. Catches drift and bugs that offline sets miss; latency-bounded and cost-bounded by design. See also: online vs offline.

F

F1 score

The harmonic mean of precision and recall, a default summary metric for binary classification with skewed classes. F1 is more honest than accuracy on imbalanced data but hides which side (precision or recall) is hurting; report both. See also: accuracy.

FActScore

A factuality metric for long-form generation (Min et al., 2023) that decomposes a generated passage into atomic facts and checks each against a reference knowledge base. Standard tool for evaluating long-form biographies, encyclopedic summaries, and similar reference-grounded outputs. See also: generation faithfulness, atomic fact.

Faithfulness

In RAG eval, the property of a generation that says only what its retrieved context supports, without adding unsupported claims. Distinct from accuracy: a generation can be 100% faithful and still wrong if the context is wrong. RAGAS faithfulness implements an atomic-fact decomposition plus per-fact verification. See also: generation faithfulness, FActScore.

FFRR

Format-following refusal rate. A measure of how often a model refuses to produce a requested format (JSON, XML, a specific schema) even when the request is benign. Distinct from a content-driven refusal; FFRR shows up on instruction-following benchmarks like IFEval. See also: IFEval.

FrontierMath

A 2024 benchmark of expert-written math problems (Glazer et al.) designed to be hard for frontier models and resistant to contamination because the problems are unpublished. Prevents the "GSM8K is saturated" problem and exposes large gaps in current LLM math ability. See also: GPQA, saturation.

G

G-Eval

A reference-free LLM-judge framework (Liu et al., 2023) that asks the judge to score on multiple dimensions using a CoT-style structured prompt and to output token probability-weighted scores. The original paper claimed strong correlation with humans; subsequent work flagged calibration problems on harder tasks. See also: LLM-as-Judge, judge prompt structure.

GAIA

A 2023 benchmark for general AI assistants (Mialon et al., Meta + HuggingFace) covering web browsing, multi-modal reasoning, and tool use. Designed to be easy for humans (~92%) and hard for assistants (GPT-4 with plugins around 15% at release). See also: agentic eval.

Golden set

A small, hand-labeled dataset used as ground truth for judge calibration and regression testing. The golden set is the contract with your future self: it does not grow much over time, every example has been reviewed by the principal domain expert, and judge agreement is measured against it. The first one is usually a byproduct of judge calibration, not a separate project. See also: construction, truth set, evals without a gold dataset.

GPQA

Graduate-Level Google-Proof Q&A (Rein et al., 2023), a benchmark of expert-written physics, chemistry, and biology questions where the answer requires graduate-level domain knowledge that cannot be googled trivially. Used to measure frontier reasoning ability without contamination from web text. See also: FrontierMath, contamination.

Guardrail

A real-time check that fires before or after a model call to block or rewrite content that violates a policy (PII, profanity, jailbreak detection, schema validation). Different from an eval: a guardrail runs on every request and has a strict latency budget; an eval runs offline or asynchronously and can afford to be more expensive. See also: eval vs monitoring vs guardrails.

H

HarmBench

A safety benchmark from CMU (Mazeika et al., 2024) that measures attack success rates of various jailbreak techniques against a target model. Used both to grade defenses and as a substrate for red-team automation. See also: red-team program design, OWASP LLM Top 10.

Hit rate

In retrieval eval, the fraction of queries for which at least one relevant document appears in the top-K results. Simpler than Recall@K (which counts how many relevant docs you got) and useful when only one relevant doc exists per query. See also: Recall@K, retrieval vs generation.

HumanEval

OpenAI's 2021 Python-function code benchmark: 164 hand-written problems where the model writes a function body and is graded by passing hidden unit tests. Now contaminated for most public models; LiveCodeBench and SWE-bench are the modern replacements for code eval. See also: SWE-bench, contamination.

I

IFEval

Instruction-Following Eval (Zhou et al., 2023), a benchmark that tests whether models obey 25 categories of verifiable instructions (use exactly 3 paragraphs, end with "P.S.", respond in JSON). Verifiable by rule, no judge needed. See also: JSON Schema.

Inter-rater agreement

The degree to which multiple human raters (or judge runs) assign the same label to the same example. Measured by Cohen's kappa, Krippendorff's alpha, or pairwise agreement; without measuring it you cannot tell whether a "60% accuracy" judge is broken or just noisy. See also: Cohen's kappa, Krippendorff's alpha.

Inspect AI

The UK AI Safety Institute's open-source eval harness, popular for serious safety and capabilities eval. Inspect ships with strong primitives for tool use, multi-step solver pipelines, sandboxed code execution, and async batched judging. See also: LLM-as-Judge introduction.

J

JSON Schema

A vocabulary for declaring the structure of JSON data: required fields, types, enums, nested objects, length constraints. Used as the contract for structured-output evals and as the validation layer in tool-call pipelines. See also: tool-call evaluation, IFEval.

JudgeBench

A benchmark (Tan et al., 2024) that tests whether LLM judges can correctly grade hard reasoning, math, and code outputs. The headline result: GPT-4o performs near random on these categories. The takeaway is not to retire LLM judges but to stop using them alone on hard reasoning. See also: judgebench honest take.

JudgeLM

A family of distilled, fine-tuned LLM judges (Zhu et al., 2023) trained on synthetic judge data and released as small (7B-33B) open models for inline production use. See also: Prometheus, distillation (judge).

K

Krippendorff's alpha

A general inter-rater agreement statistic that handles multiple raters, missing data, and any level of measurement (nominal, ordinal, interval). Default choice when you have three or more annotators or partial annotation. See also: Cohen's kappa, inter-rater agreement.

L

LangSmith

LangChain's hosted observability and eval platform, tightly integrated with the LangChain SDK. Strong if your stack is already LangChain; the typical alternative is Langfuse for a more SDK-neutral approach. See also: Langfuse.

Langfuse

An open-source observability platform for LLM apps with traces, datasets, prompts, and an LLM-judge harness. Self-hostable, SDK-neutral, and the default recommendation on this site for production observability where you want to own your data. See also: LangSmith, Phoenix.

Likert

A scale that asks a rater to score on a fixed numeric range, usually 1 to 5 or 1 to 7. Likert scales feel intuitive and are usually wrong for LLM eval: humans and judges drift, the granularity is illusory, and binary pass/fail with a written critique recovers more signal. See also: judge prompt structure.

LLM-as-Judge

The pattern of using a language model to score outputs from another (or the same) language model against a written rubric. Works for open-ended generation, RAG faithfulness, tone, and instruction following; fails for hard reasoning, math, and code, where executable verification is the right primary signal. See also: LLM-as-Judge introduction, biases and mitigations.

LMArena

A crowd-sourced pairwise eval platform (formerly Chatbot Arena) where users vote on side-by-side anonymous model responses and a Bradley-Terry fit converts the votes into a public Elo-style leaderboard. The largest single source of human pairwise preference data. See also: Elo, pairwise.

Long-context

Workloads where the input prompt approaches or exceeds the model's full context window, typically 100K tokens and up. Long-context eval has its own failure modes (lost-in-the-middle, needle-in-a-haystack misses, attention dilution) that short-context evals do not catch. See also: Needle-in-a-Haystack.

Loss

In training, the scalar function the optimizer minimizes; in eval, sometimes used loosely to mean the model's negative log-likelihood on a holdout set (perplexity). Rarely the right primary eval metric for product-facing LLM systems. See also: accuracy.

M

MMLU

Massive Multitask Language Understanding (Hendrycks et al., 2020), a 57-subject multiple-choice benchmark covering high-school and college-level material. The most-cited general knowledge benchmark; widely contaminated and now considered close to saturated for frontier models. See also: MMMU, saturation, contamination.

MMMU

Massive Multi-discipline Multimodal Understanding (Yue et al., 2023), a multimodal successor to MMLU covering 30 subjects with mixed text and image questions at college level. Standard reference for multimodal capability eval. See also: MMLU.

MRR

Mean Reciprocal Rank, a retrieval metric: the reciprocal of the rank at which the first relevant document appears, averaged across queries. Strict on the top result; weak signal when relevant docs cluster in the lower ranks. See also: MRR chapter, Recall@K, NDCG.

N

NDCG

Normalized Discounted Cumulative Gain, a retrieval metric that rewards correct ordering and applies a logarithmic discount to lower positions. Default ranking metric for graded relevance judgments. See also: NDCG chapter, MRR, Recall@K.

Needle-in-a-Haystack (NIAH)

A long-context probe (Greg Kamradt's variant from 2023): plant a specific fact ("the magic number is 1729") inside a long document of unrelated text, then ask the model to retrieve it. Reveals attention failure across context lengths and positions. See also: long-context.

O

(no entries this letter; see P for production-related terms.)

P

Pairwise

A comparison pattern where the judge sees two responses side by side and picks the better one (or "tie"). Pairwise is the right default for subjective qualities (tone, helpfulness, style) where a Likert score would drift; the cost is twice the judge calls per example. See also: Bradley-Terry, Elo.

Pass@k

A code-eval metric: the fraction of problems where at least one of k sampled solutions passes the hidden tests. Pass@1 is the strict version; pass@10 and pass@100 reward sampling-time compute. Standard reporting metric on HumanEval and most code benchmarks. See also: pass^k, HumanEval.

Pass^k

An agentic-eval metric (Yao et al., τ-bench): the fraction of problems where all k independent reruns succeed. Pass^k catches catastrophic inconsistency that pass@1 hides; if your agent succeeds 90% of the time per try, pass^4 is 0.65 and pass^8 is 0.43. See also: pass^k chapter.

Phoenix

Arize's open-source LLM observability platform, sibling to Langfuse with stronger integration into the broader Arize feature set for tracing, datasets, and inline evals. See also: Langfuse, LangSmith.

Position bias

The tendency of a pairwise LLM judge to prefer whichever response appears first (or sometimes second), independent of quality. The standard fix is to swap order on every comparison and require agreement under both orderings; a judge that flips its answer under position swap is not trustworthy. See also: biases and mitigations.

PRM (Process Reward Model)

A reward model that scores intermediate reasoning steps, not just final answers. Used in math and code RL pipelines (OpenAI PRM800K, Math-Shepherd, Let's Verify Step by Step). Distinct from an outcome reward model. See also: reward hacking.

Prompt sensitivity

The empirical fact that small, semantically irrelevant changes to a prompt (extra whitespace, paraphrased instructions, the order of options) can swing a benchmark score by tens of points. Demonstrated by Sclar et al. 2023 and others; the implication is that single-prompt benchmark numbers are noisy. See also: non-determinism and subjectivity.

Prometheus

A family of open, instruction-tuned judge models (Kim et al., 2023 and 2024) trained on a large synthetic dataset of rubric-grounded judgments. Used as a distilled judge for production deployment. See also: distillation (judge), JudgeLM.

R

Rag

Retrieval-Augmented Generation: an architecture pattern where a retrieval step fetches relevant documents from a corpus and the language model conditions its generation on them. Eval has to grade retrieval and generation separately; conflating them hides which layer broke. See also: retrieval vs generation, generation faithfulness.

Ragas

A Python library for RAG-specific evals: faithfulness, answer relevance, context precision, context recall, and synthetic query generation. Lightweight, judge-based, and easy to drop into an existing RAG pipeline. See also: retrieval vs generation.

Recall@K

A retrieval metric: the fraction of relevant documents that appear in the top K results. Recall@K is the right default when you care about coverage; pair with precision or MRR when you care about ordering as well. See also: Recall@K chapter, Hit rate, NDCG.

Red-team

The structured practice of probing an AI system for failure: jailbreaks, prompt injection, harmful content elicitation, policy bypass. A red-team program produces reproducible attacks, a triage workflow, and a feed of new failure modes back into the eval set. See also: red-team program design, HarmBench.

Reference-based metric

Any eval metric that compares a model output to a reference (or set of references) and produces a similarity score. BLEU, ROUGE, BERTScore, COMET, and XCOMET are reference-based; LLM-as-Judge and pairwise human preference are reference-free. See also: BLEU, COMET, reference-free evaluation, evals without a gold dataset.

Reference-free evaluation

Any eval that scores a model output without a per-input gold answer: code assertions, an LLM judge applying a written rubric, faithfulness checks against retrieved context, pairwise preference. The standard starting point for open-ended tasks where gold is expensive and ill-defined; on this site it is treated as the bootstrap path to a gold dataset, not a permanent alternative to one. See also: evals without a gold dataset, reference-based metric, LLM-as-Judge.

Regression set

A small, stable eval set that every release runs against to catch regressions on known-good behaviors. The regression set grows slowly and only when a real bug is fixed; it is not the place to dump every example you have ever labeled. See also: regression sets.

Reward hacking

A failure mode where a model finds an unintended shortcut that scores high on the proxy metric (an LLM judge, an automated grader) without actually solving the task. Pan et al. 2024 catalog in-context hacking patterns; Krakovna's spec-gaming taxonomy from DeepMind is the canonical broader catalog. See also: reward hacking chapter.

ROUGE

Recall-Oriented Understudy for Gisting Evaluation, an n-gram overlap metric introduced for summarization in 2004. Still cited for backwards compatibility; superseded by learned metrics and LLM-judge for almost all modern summarization eval. See also: BLEU, BERTScore.

RPM (Reward Hacking Benchmark)

Reward Process Monitoring, an emerging family of evals that grade whether a model's chain of thought is faithful to the action it eventually takes, surfacing reward hacking inside reasoning chains. See also: reward hacking.

Rubric

A written scoring guide used by judges (LLM or human) to convert a free-form output into a label or score. A good rubric specifies the dimensions, the binary criteria for pass on each, and ideally a worked anchor example. The rubric is the single most consequential artifact in any LLM-judge eval; a vague rubric produces vague scores. See also: judge prompt structure, anchor.

S

Sandbagging

The pattern where a model deliberately under-performs on capability evals (especially safety evals) to evade scrutiny, then performs at full capability in deployment. A live concern in capability and alignment eval; defenses include diverse prompts, behavioral observation across long horizons, and held-out probes. See also: reward hacking.

Saturation (benchmark)

The state where a benchmark stops separating models because the top models all score near the ceiling. MMLU, HumanEval, and GSM8K are widely considered saturated for frontier models as of 2026; FrontierMath, GPQA, and ARC-AGI were built to delay saturation by raising difficulty and resisting contamination. See also: FrontierMath, GPQA, contamination.

Schema (output)

A formal contract for model output structure (usually JSON Schema or zod / pydantic). Schema-conformance is the cheapest eval you can run: parse the output, fail closed. Pair with a content-grade eval for the substantive correctness. See also: JSON Schema, tool-call evaluation.

Self-consistency

A sampling technique (Wang et al., 2022): generate K reasoning chains at high temperature, take a majority vote across the final answers, and report the most common one. Improves accuracy on math and reasoning benchmarks; expensive at K large. See also: CoT.

Sphinx

A category of agentic challenges where the model has to extract structured information from messy real-world documents (PDFs, scanned forms, semi-structured HTML) and answer questions over them. Sphinx-style evals stress OCR, layout understanding, and multi-hop reasoning. See also: GAIA.

Spider

A text-to-SQL benchmark (Yu et al., 2018) covering 200 databases across 138 domains. Spider evaluates whether models can translate natural-language questions into executable SQL queries; standard primary eval for text-to-SQL systems alongside execution accuracy. See also: tool-call evaluation.

Statistical power

The probability that a comparison correctly detects a real difference of a given effect size. Underpowered evals miss real wins (false negatives); the fix is more samples, paired tests when applicable, or a larger effect-size target. See also: confidence intervals, effect size.

SWE-bench

A real-world software engineering benchmark from Princeton (Jimenez et al., 2024): the model is given a GitHub issue and the relevant repository, and graded on whether its patch makes the hidden tests pass. SWE-bench Verified is the human-validated subset of 500 problems used as the production benchmark. See also: HumanEval.

Sycophancy

The tendency of a model (or judge) to agree with the user's stated position even when the position is wrong. Sycophancy is one of the standard judge biases; the mitigation is anchored rubrics with explicit "disagree if the user is wrong" framing. See also: biases and mitigations.

T

τ-bench

A 2024 benchmark from Anthropic and Sierra (Yao et al.) for tool-use agents in customer-service scenarios. Each task is a multi-step dialogue with a simulated user and a backend API; the agent must execute the right tool calls in the right order and respect policy. τ-bench introduced the pass^k metric. See also: pass^k, tool-call evaluation.

Token

The smallest unit of text the model processes: roughly 0.75 English words per token, but the exact split depends on the tokenizer. Per-token cost and per-token latency are the cost-side primary metrics for any production LLM system. See also: trace.

Tool-call F1

A composite metric for agentic eval: precision and recall on the set of (tool, arguments) calls the agent makes, harmonic-meaned together. Reported in BFCL and similar benchmarks. See also: tool-call evaluation, BFCL.

Trace

A structured record of one model interaction (or one multi-step agent run), capturing inputs, outputs, intermediate tool calls, latencies, costs, and any annotations. Traces are the substrate of every serious eval program; without them, error analysis is anecdotal. See also: trace schema, the data viewer.

Trajectory

The ordered sequence of states, actions, and observations in an agent run. Trajectory eval grades the path itself (did the agent take a sensible route?) and not only the final outcome (did it reach the goal?). See also: trajectory vs outcome.

Truth set

A small set of examples whose correct answers are not in dispute. Used as the ground floor for judge calibration and as a sanity check for any new eval pipeline. See also: golden set, construction.

V

Verifier

Any component, LLM or rule-based, that takes a candidate output and returns a verdict (pass/fail or a graded score). LLM judges, unit tests, schema validators, and atomic-fact checkers are all verifiers. The right verifier choice depends on whether you have a programmatic ground truth or not. See also: LLM-as-Judge.

W

Wilson interval

The Wilson score confidence interval for a binomial proportion: the right default for "how confident am I that this judge has 85% true-positive rate" when N is moderate. Better-behaved than the textbook Wald interval at small N and at extreme proportions (close to 0 or 1). See also: confidence intervals.

X

XCOMET

A learned multilingual MT metric (Guerreiro et al., 2023), a successor to COMET that adds fine-grained error span detection in addition to a quality score. Strong correlation with human ratings across language pairs. See also: COMET, BLEU.