Start Here for Senior Engineering Managers

Eight steps from the maturity model through the skill bar for eval-literate engineers. Strategy first, tables and decision flowcharts, light on code.

Engineering managers shipping AI products own decisions that engineers do not: headcount, vendor selection, the postmortem template, the release call. This track is strategy-first. It is heavy on case studies (NurtureBoss, Notion AI, Replit), light on code, and assumes technical literacy without requiring you to read line-by-line implementation. The eight steps move from the conceptual ladder (where is your team?) to the operational artifacts you ship (platform decision, postmortem template, release gate).

The first three steps are the case for the program. Read them before staffing the next quarter. Steps four through seven are the operational practice. The last one is about the discipline you screen for when the team grows.

The track

The eval maturity model (14 min). Five stages from vibe-checking to continuous quality. Self-assessment checklist. The Shankar flywheel framing for the highest stage ¹.
The error-analysis ritual (16 min). Open coding for AI traces, the three-stage process, choosing the principal domain expert. The single weekly meeting that catches everything ².
Who owns evals: team shape (12 min). Who actually owns the eval. The argument for product and domain experts in the loop ³. Notion AI's "AI Data Specialist" role is the worked example ⁴.
Buy vs build: the eval platform decision (10 min). Four platforms vs a JSONL in git. The source-of-truth principle, the three decision axes, and what stays out of the vendor regardless of which you pick.
Postmortem template for AI incidents (12 min). The four-section template, and why traditional postmortem templates underperform when the root cause is non-deterministic.
Running the quality loop on a cadence (10 min). Observe, hypothesize, experiment, measure, iterate, mapped to a product team. Cadence beats heroics; the flywheel tightens as it runs ¹.
Release gates: the ship or hold rule (12 min). Four-band gates, override discipline, and the handoff to incident response. The decision rule you sign off on.
The skill bar: trace-reading discipline (12 min). The 60-80 percent rule and the weekly cadence. The work sample to screen engineers against: data fluency, statistical literacy, willingness to read 100 traces a week. The Applied LLMs essay is the field manual ⁵.

What comes after the track

Once the eight pages are read, the natural next moves are: build the case-study deck (NurtureBoss, Replit's 90% cost reduction via decision-time guidance ⁶); write the eval-program charter for the next quarter; pick a vendor and pilot one team on it.

TIP

The skill bar matters most. A team of strong engineers who refuse to look at traces will not ship a reliable eval program; a team of mid-level engineers who read 100 traces a week will.

Shreya Shankar, "The AI Engineering Flywheel." ↩ ↩²
Hamel Husain, "A Field Guide to Rapidly Improving AI Products." ↩
Hamel Husain, "Your AI Product Needs Evals." ↩
Notion AI, "Speed, Structure, and Smarts." ↩
Yan, Bischof, Frye, Husain, Liu, Shankar, "What We Learned from a Year of Building with LLMs." ↩
Replit, "Decision-Time Guidance." ↩