Why LLM evals are hard

Non-determinism, subjectivity, prompt sensitivity, and the reasons classical ML eval frameworks do not transfer.

Classical ML evaluation assumes a fixed model, a fixed input shape, a deterministic forward pass, and ground truth two annotators would agree on. LLM evaluation breaks every one of those assumptions. This page names the breaks specifically so the rest of the site can refer back to them.

Non-determinism

Two runs of the same prompt rarely produce the same answer. Even at temperature zero, providers route across data centers, KV-cache states differ, and floating-point variation accumulates through decode. The output is a sample from a distribution, not a fixed mapping ^[1].

A single eval run is one realization of a noisy process. A 2-point movement on a 100-example set is usually inside the run-to-run variance of the same model on the same prompt. The fix is to treat eval runs as samples with confidence intervals, paired comparisons, and replications, rather than as point measurements.

Subjectivity and criteria drift

"Helpful" is not a property two annotators agree on without a shared rubric. The harder finding from Shankar et al. is that the rubric itself only becomes visible once the team has read outputs. Criteria emerge from the act of grading and keep moving as the grader sees more examples ^[2]. They call this criteria drift and document it across teams who thought they had a stable rubric.

The rubric is a downstream artifact of error analysis, not an upstream input. Teams who try to write the rubric first, then label data, end up rewriting it after the first hundred labels. Treat rubric revisions as version-controlled changes that invalidate prior labels until the team relabels.

Prompt sensitivity

LLM accuracy on a fixed task swings with surface-level prompt changes humans would treat as equivalent. Sclar et al. showed up to 76 accuracy-point swings on LLaMA-2-13B from prompt-format perturbations that preserved the question's meaning (different separators, option-label styles, whitespace) ^[3]. The effect persists across model scales and families, and the rank ordering of models changes with the prompt format.

Benchmark rankings are a function of the benchmark plus the prompt-format choice. Reporting a single number per model without the prompt format is reporting a fraction of the experiment.

Format sensitivity at small scales

The Anthropic team flagged a smaller related effect: switching multiple-choice option labels from (A) ... (D) to (1) ... (4) shifts MMLU scores by roughly five percentage points ^[4]. Five points is the size of a generation-over-generation improvement vendor blogs announce as a leap.

If a five-point swing is downstream of a label-format choice, a five-point movement on a public benchmark could be a model improvement or a format change in the eval rig. Without the rig code pinned, the number is not interpretable.

(A) Paris        (1) Paris
(B) London       (2) London
(C) Berlin       (3) Berlin
(D) Madrid       (4) Madrid

Same question. Different number.

Multi-dimensional quality

A response can be factually correct and badly written, well-written and miss the question, or answer the question and ignore the policy disclaimer. Classical ML metrics (accuracy, F1, ROC-AUC) collapse to a single scalar; LLM quality has at least three independent axes (correctness, instruction-following, safety) and often more for a given product ^[1]. Aggregating them into one score hides the failure mode you care about.

Score each axis separately and decide release gates per axis. A 2-point bump in correctness that comes with a 4-point regression in instruction-following is not a win. A scorecard that hides the trade-off ships the regression.

Why classical ML frameworks fail

Three assumptions break. The test set is supposed to be a fixed sample from the production distribution; for LLMs the production distribution drifts and the model itself changes under provider revisions. Ground truth is supposed to be a label two annotators agree on; for open-ended generation the label is a rubric output and rubrics drift ^[2]. Public benchmarks are supposed to be uncontaminated; by 2026 the major ones are contaminated, and MMLU-Redux found 6.5 percent of MMLU items have label errors or ambiguity ^[5].

Binary judgments with written critiques collapse the agreement problem to a coarser but more stable signal ^[6]. None of the breaks alone is fatal; together they require a different discipline, which the rest of the site builds.