Start Here for AI Engineers

Seven steps from why evals matter through building a tool-using agent eval. Code-first, no marketing, citation-grounded. Also available as a free PDF download.

AI engineers ship the production code that runs the eval, the judge, the agent loop, and the trace store. This track assumes Python and TypeScript fluency, basic statistics, and an ML background. It is heavy on code, light on marketing, and every claim is cited. The seven steps move you from why evals matter at all to building a tool-using agent eval that catches consistency failures.

The order is deliberate. The first step argues for the work; skipping it leaves you defending the headcount you do not yet have. Steps two and three give you a runnable artifact in an afternoon. Steps four and five are the rubric-design discipline that prevents drift. Steps six and seven cover the two task shapes most of you ship: RAG and agents.

The track

Why evals are the moat (12 min). The three reasons LLM features rot (non-determinism, subjectivity, drift) and why ship-and-pray fails for AI. Hamel's argument is the canonical one ¹.
The 60-80 percent rule (10 min). The NurtureBoss case study: 66% to 5% date errors via error analysis. The prioritization framework (frequency times severity) and a weekly cadence ².
Your first eval in 30 minutes (30 min). The runnable artifact. Anthropic SDK example with fifty labeled traces, a binary judge, and a confusion matrix. Skip if you have shipped a calibrated judge before.
Binary vs Likert: pick binary (10 min). Why binary plus critique wins: it forces falsifiable criteria, resists criteria drift, and decomposes multi-dimensional quality into multiple binaries instead of one mushy scalar ³.
Building an LLM judge (12 min). The four-part judge prompt (task framing, rubric, reference, output format) with a copyable template ⁴.
RAG: split retrieval and generation (11 min). The two-layer principle, retrieval metrics catalog, generation metrics catalog, and the diagnostic table for which layer broke ⁵.
Tool-using agent eval (10 min). Trajectory-level scoring, tool-call AST equivalence, pass^k for consistency. The τ-bench finding on consistency gaps is the headline ⁶.

What comes after the track

Once these seven are reading list, the natural next pages are multi-turn conversation eval, CI integration patterns, and judge drift detection. The cookbook has the runnable recipes; the techniques reference pages cover the methodology depth. For per-metric depth on step 6, the retrieval metrics and generation metrics catalogs give each metric its own page.

TIP

Read step 1 even if you have already done eval work. The argument is what you cite when someone on the team asks why this gets headcount.

Hamel Husain, "Your AI Product Needs Evals." ↩
Hamel Husain, "A Field Guide to Rapidly Improving AI Products." The NurtureBoss case study. ↩
Zheng et al., MT-Bench. NeurIPS 2023 D&B. ↩
Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." ↩
Es et al., "RAGAS." arxiv 2309.15217. ↩
Yao et al., τ-bench. arxiv 2406.12045. ↩