Start Here for Senior Product Managers

Nine steps from the AI quality bar through a foundational reading list. Product-thinking, customer-friendly vocabulary, light on math.

Product managers shipping AI features write PRDs against non-deterministic systems, brief customer-facing teams on what "good" means when the model varies, and own the decision to ship or hold a release. This track is product-thinking. it ships worked examples of real quality bars, frameworks for acceptance criteria and quarterly goals, and the vocabulary you need with both customers and engineers. It is light on math and heavy on conversation patterns.

The first three steps build the shared language with engineering and the domain expert. Steps four through six are the planning artifacts (acceptance criteria, quality goals). The last three are operational (beta, ship/hold, reading list).

The track

The case for an AI quality bar (12 min). Why acceptance criteria for AI features must be expressed as evals, not as feature checklists. The argument to open your next PRD review with, and why it is non-negotiable ¹.
Talking to customers about non-deterministic AI (12 min). The vocabulary your customers can use. What "good" means when the model varies. The honest explanation of why two identical questions can produce two slightly different answers.
Working with your domain expert (14 min). The principal domain expert pattern. How to recruit, brief, and run weekly sessions. Notion AI's "AI Data Specialist" role is a worked example ².
Writing scenario-level acceptance criteria (12 min). Feature times scenario times persona. The dimensional coverage framework that turns "the bot should be helpful" into a checkable list.
Reading an eval report as a PM (14 min). What the columns mean. TPR, TNR, and why agreement percent misleads. Where to push back on the engineering team's headline number.
Quality goals by maturity stage (14 min). The five-stage maturity model as the scaffold for quarterly quality goals: locate your team's stage, pick the next investment, and know what gets cut when the quarter compresses.
Instrumenting the beta: online eval (14 min). Online eval as the beta instrument: sampling policy, the metric ledger, and the feedback events the rollout must capture. The flywheel framing for the rollout ³.
Quality regressions: when to ship vs hold (12 min). The decision framework. The role of error bars in the call. Why "trending down" by two points may be inside the noise band.
The foundational reading list (8 min). Ten papers with one-line takes: what an eval is, what a benchmark is, why rankings flip when nothing changed, and the Anthropic error-bars paper from step 8. Pair it with the three blogs every AI PM should follow: Hamel, Eugene Yan, Applied LLMs ⁴.

What comes after the track

Once the nine pages are read, the natural next moves: write the AI section of your next PRD; recruit a domain expert and book the weekly session; draft the beta program for the next feature. The Hamel field guide is the single best deep-read for a PM on what good looks like at a real team ⁵. The LLM-judge essay is the right read on critique shadowing ⁶.

TIP

The quality-bar argument (step 1) and the domain expert pattern (step 3) are the two highest-impact things a PM brings to an AI feature. Everything else builds on them.

Hamel Husain, "Your AI Product Needs Evals." ↩
Notion AI, "Speed, Structure, and Smarts." ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩
Yan, Bischof, Frye, Husain, Liu, Shankar, "What We Learned from a Year of Building with LLMs." ↩
Hamel Husain, "A Field Guide to Rapidly Improving AI Products." ↩
Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." ↩