Every LLM feature passes a vibe check on the first ten examples. The eleventh breaks it. Evals are the thing that tells you which eleventh you shipped and how often it happens [1]. The whole rest of this site is about how to build that telling. The point of this page is to convince the reader, before any tooling, that the work is worth it.
The three reasons LLM features rot
Three properties of LLM systems make ship-and-pray feel fine for a week and ruin a product over a quarter.
Non-determinism. Same prompt, two runs, two answers. Even at temperature zero, different providers, different time-of-day routing, and different KV-cache states produce drift you cannot reproduce on demand. Classical ML eval frameworks assume a fixed model and a fixed input shape. Both assumptions are false [2].
Subjectivity. "Helpful" is not a label two annotators will agree on without a shared rubric. The criteria themselves only become visible when you read outputs; Shankar et al. call this criteria drift and show that even careful teams cannot fully specify their grading rubric until they have looked at the outputs they want to grade [3]. So the rubric is a downstream artifact of error analysis, not an upstream input.
Drift. Users behave differently from the synthetic seed dataset on day one, differently again on day thirty as your marketing changes, and again on day ninety as a competitor launches. The model under the hood drifts too as providers swap minor revisions in. Huyen's framing is that evaluation is not a stage, it is a continuous concern at every step of the platform [4].
NOTE
If you only remember one thing: LLM evaluation is not the QA pass after a build. It is the closed loop that makes the build possible at all.
Put together, the three reasons mean that traditional QA assumptions break in both directions. You cannot write a deterministic test that fails reliably on a regression. You cannot write a static rubric that captures the failure modes you have not seen yet. You cannot freeze the input distribution because the users are the ones generating it.
Why ship-and-pray fails
The default workflow goes: someone writes a prompt, clicks around the demo, fixes the obvious problems, and ships. Then a Slack thread accumulates. "The bot suggested asparagus to the keto user." "It said it would check on something and then never did." "It rendered markdown in an SMS." Each report is patched. Each patch breaks something else. Six weeks in, nobody on the team can tell whether yesterday's release is better or worse than last Tuesday's [5].
The pattern is familiar to anyone who has shipped without unit tests. It is worse for LLMs because the surface area is bigger and the regressions are harder to spot. A failed unit test is a red line in CI. A regression in helpfulness shows up as a slow trickle of churn three weeks later.
flowchart LR
A[Idea] --> B[Prompt change]
B --> C[Click around demo]
C --> D[Ship]
D --> E[User reports]
E --> F[Patch the bug]
F --> B
D -.-> G[Reputation damage]
G -.-> H[Churn]
subgraph eval-gated
P[Idea] --> Q[Prompt change]
Q --> R[Run eval suite]
R --> S{Score worse?}
S -- yes --> Q
S -- no --> T[Ship behind flag]
T --> U[Online eval on sampled traces]
U --> V[Hold or roll out]
end
The eval-gated loop is not magic. It is the same loop a discipline like statistical process control has used for forty years: measure the thing, gate the change on the measurement, keep the measurements honest. The hard part is not the gate. The hard part is the measurement.
Vibe-checking is fine until it isn't
The most common eval at most teams is one person clicking around and saying "feels good." That is fine for the first prototype. The day it stops working is the day two people disagree on what counts as broken. From that day on, every prompt change becomes a debate, and the debate is unresolvable because there is no shared artifact to point at [6].
The fix is not a dashboard. The fix is a tiny labeled dataset and a binary judgment per example. Thirty examples, three failure categories, one column of pass/fail. That is a stage-1 eval. It will catch most of what vibe-checking caught and end the unresolvable debates.
The pages in error analysis cover how to find the right thirty examples and the right three categories. The pages in LLM-as-Judge cover how to scale judgment past what one person can read.
What an eval actually looks like
Strip away the tooling. An eval is a function from a trace to a score, run against a fixed dataset.
from dataclasses import dataclass
from typing import Literal
@dataclass
class Trace:
user_input: str
tool_calls: list[dict]
final_response: str
@dataclass
class Rubric:
name: str
binary_question: str # e.g. "Did the bot honor the dietary restriction?"
@dataclass
class Score:
label: Literal["pass", "fail"]
critique: str
def judge(trace: Trace, rubric: Rubric, llm) -> Score:
"""One-line judge. Pass/fail plus a one-sentence reason."""
prompt = f"""You are grading an assistant trace against a single criterion.
CRITERION: {rubric.binary_question}
USER INPUT: {trace.user_input}
ASSISTANT RESPONSE: {trace.final_response}
Reply in two lines:
LABEL: pass or fail
CRITIQUE: one sentence explaining why."""
raw = llm(prompt, temperature=0)
label_line, critique_line = raw.strip().splitlines()[:2]
label = "pass" if "pass" in label_line.lower() else "fail"
critique = critique_line.split(":", 1)[1].strip()
return Score(label=label, critique=critique)
The example is intentionally crude. Three properties promote it from toy to useful artifact.
First, the criterion is binary. Not "rate helpfulness 1 to 5." Pass or fail, and a one-sentence critique. Binary forces clarity, raises inter-rater agreement, and lowers the sample size you need to detect a real change [1]. The 1-to-5 scale invites parking on 3 and hides disagreement under aggregated means.
Second, the rubric is calibrated against a human expert on a held-out set. You do not trust the judge until its true-positive rate and true-negative rate are both above some bar you set, on labels a human gave you. Pages in the LLM-as-Judge section cover the calibration loop. The short version: 100 hand-labeled examples, three iterations, TPR and TNR each above 0.9.
Third, the score correlates with whatever the product actually optimizes for. A judge that says "the response is helpful" while users churn is decoration. Shankar's flywheel framing puts it as: scores have to come back to a product KPI you can move [7].
The eval gap in your org
Most teams I have seen are at one of three stages of denial. The pattern is consistent enough that you can self-diagnose.
The first stage is "we have evals." Pressed on what that means, the answer is a screenshot of a one-time benchmark run from six months ago against the team's first prompt. There is no dataset, no judge, no cadence. The "eval" is a number nobody can reproduce.
The second stage is "we are about to start evals." The team has bought a vendor platform, integrated the SDK, and pointed it at production traces. No one has read the traces. There is no labeled set. The dashboard has thirty metrics and zero failure modes. The team is rate-limited by reading, not tooling [5].
The third stage is "we have one judge in CI." The judge runs on every PR against a fixed set of fifty examples. It catches the obvious regressions. It misses the subtle ones because the rubric was written before the team understood the failure modes. The fix is not a smarter judge. The fix is recurring error analysis where the rubric itself is on the table [3].
The way out of all three stages is the same: read the traces. Hamel's field guide says it most plainly: "remove all friction from looking at data" [6]. The pages in error analysis teach the discipline.
Who owns evals
The default org chart puts evals on engineering. That is wrong. Engineers know whether the code works. They do not, on average, know whether the answer is right for the user. The teams shipping the best AI products have product managers and domain experts personally reviewing hundreds of traces a week. The engineers build the viewer. The product owns the rubric [6].
QA engineers fit naturally too. The mindset of "systematically try to break this, then write down what broke" is exactly what error analysis needs. The difference from classical QA is that the expected output is a moving target, so the test is a judgment, not an assertion.
TIP
If you cannot name the single domain expert who has read 100 traces this month, your eval program is not under way. It is queued.
The Anthropic team is direct about this in their own writeup on evaluating AI systems: useful evaluations require domain expertise, and the people closest to the use case have to be in the room [8].
Public benchmarks are not your eval
Most teams new to evals reach first for a public benchmark. Public benchmarks are useful for model selection at the macro level. They are not useful for your product.
Three reasons. Public benchmarks are contaminated; anything published is in training, so headline scores are a lower bound on overfitting, not an upper bound on capability. Public benchmarks measure capability on tasks that look nothing like your inputs; your customer-service bot is not solving GPQA. Public benchmarks aggregate; a 2% MMLU swing tells you nothing about whether your specific failure mode got worse [9].
Worse, public benchmarks are fragile in ways that make naive interpretation dangerous. Sclar et al. showed that swapping plausible prompt formats produces accuracy swings of up to 76 points on the same task [10]. Anthropic flagged that a change from (A) to (1) shifts MMLU by roughly 5% [8]. If a public benchmark moved 5%, you cannot tell whether the model improved or someone changed the option-label format.
The lesson is not that benchmarks are useless. It is that a public benchmark plus a private regression set with your inputs is the minimum. Public for the headline, private for the decision.
What the rest of this site covers
Read the maturity model next to locate your team on the ladder. Skip ahead to error analysis if you already know where you are and want to start reading traces. The LLM-as-Judge section teaches the technique that turns hand labels into automated scores. The statistics section covers the math that keeps you honest about whether a score change is real. The production section covers what changes when the judge has to live in the request lifecycle.
If you came for a recipe, the cookbook has runnable starting points. If you came for a framework comparison, the tools section has it. The rest of this site is the toolbox.