Foundations

Why evals matter, the maturity model, and what makes LLM evaluation hard. Also available as a free PDF download.

The six pages here lay out the case for evals, the maturity ladder you climb, the reasons LLM evaluation does not look like classical ML eval, and where to start when nothing is labeled. Read them in order if you are new. Skim if you are already shipping.

Most teams arrive thinking the gap is tooling. The gap is usually conceptual: a half-built mental model of what evals are for, and which rung of the ladder pays off this quarter. The pages below fix that before you touch a framework. When you get to guardrails, the condensed reference is guardrails versus evals architecture.

Chapters:

Why evals matter. The case for evals as the moat, the three reasons LLM features rot, and what a useful eval actually looks like.
The eval maturity model. Five rungs from vibe-checking to a continuous quality program, with the smallest useful artifact at each rung.
AI development as scientific method. The observe-hypothesize-experiment-measure loop mapped onto an AI product team.
Why LLM evals are hard. Non-determinism, subjectivity, prompt sensitivity, and why classical ML eval frameworks fail.
Eval vs monitoring vs guardrails. Three concepts that share traces but do different jobs, and where each fires in a request lifecycle.
Evals without a gold dataset. Which tasks deserve gold from day one, which start reference-free, and how calibration labels become the gold set you were missing.

After this section, read error analysis next, or pick a role-based track at /start.