Principal AI engineers ship the production code that runs the eval, the judge, the agent loop, and the trace store. This track assumes Python and TypeScript fluency, basic statistics, and an ML background. It is heavy on code, light on marketing, and every claim is cited. The seven steps move you from why evals matter at all to building a tool-using agent eval that catches consistency failures.
The order is deliberate. The first step argues for the work; skipping it leaves you defending the headcount you do not yet have. Steps two and three give you a runnable artifact in an afternoon. Steps four and five are the rubric-design discipline that prevents drift. Steps six and seven cover the two task shapes most of you ship: RAG and agents.
The track
- Why evals are the moat (12 min). The three reasons LLM features rot (non-determinism, subjectivity, drift) and why ship-and-pray fails for AI. Hamel's argument is the canonical one 1.
- The 60-80 percent rule (10 min). The NurtureBoss case study: 66% to 5% date errors via error analysis. The prioritization framework (frequency times severity) and a weekly cadence 2.
- Your first eval in 30 minutes (30 min). The runnable artifact. Anthropic SDK example with fifty labeled traces, a binary judge, and a confusion matrix. Skip if you have shipped a calibrated judge before.
- Binary vs Likert: pick binary (10 min). Why binary plus critique wins on three dimensions: required N, inter-rater agreement, and the forcing function on rubric authors 3.
- Building an LLM judge (12 min). The four-part judge prompt (task framing, rubric, reference, output format) with a copyable template 4.
- RAG: split retrieval and generation (11 min). The two-layer principle, retrieval metrics catalog, generation metrics catalog, and the diagnostic table for which layer broke 5.
- Tool-using agent eval (10 min). Trajectory-level scoring, tool-call AST equivalence, pass^k for consistency. The τ-bench finding on consistency gaps is the headline 6.
What comes after the track
Once these seven are reading list, the natural next pages are multi-turn conversation eval, CI integration patterns, and judge drift detection. The cookbook has the runnable recipes; the techniques reference pages cover the methodology depth.
TIP
Read step 1 even if you have already done eval work. The argument is what you cite when someone on the team asks why this gets headcount.