Start Here for CTOs and Directors of Technology

Ten steps from the executive case for evals through a board readout. Risk posture, regulatory mapping, frontier-lab benchmarks.

CTOs and directors of technology own the risk story, not the implementation detail. This track is executive-grade. it is dense but every page leads with a one-line TL;DR, every regulatory citation has an effective date, and every claim is sourced. The ten steps move from the board-level case for the program to the actual board readout template you reuse.

The first five steps build the strategic posture. The last five give you the artifacts your team produces to support it. Read in order on the first pass; come back to specific steps when the board asks a question.

The track

Executive overview: why evals are board-level (12 min). The case for the eval program as defensible risk posture, not engineering hygiene. The non-determinism, subjectivity, and drift arguments framed for non-engineers ¹.
EU AI Act timeline (16 min). GPAI obligations live as of August 2, 2025; full applicability August 2026. The document you actually file ².
NIST AI RMF mapped to eval activities (16 min). The four functions (Govern, Map, Measure, Manage) cross-walked to specific eval evidence. What goes in an audit binder ³.
OWASP LLM Top 10: your exposure (14 min). Each item: what it is, how evals detect it, what to file in the risk register ⁴.
What frontier-lab evals look like (14 min). Anthropic's Responsible Scaling Policy ⁵ and OpenAI's Preparedness Framework ⁶ are the vocabulary your peers use and the standard your investors and customers will reference.
Building an AI risk register (12 min). The template, scoring methodology, and how each row links to a product-risk artifact.
Customer-facing trust artifacts (12 min). Trust portal page content, model card framing, what to share publicly and what to gate behind an NDA.
Red-team program design (14 min). Internal vs external, cadence, scope. What goes in the contract with an external red-team vendor.
Cost and latency in the eval scorecard (10 min). Per-request cost and p95 latency tracked in the same table as quality, with alert thresholds and the Pareto framing for model choice. The numbers your eval program should put in front of you before a model swap.
Talking to your board about AI risk (16 min). The board-readout template: one-pager, risk matrix, mitigation-status table. What to leave out.

What comes after the track

Once the ten pages are read, the natural next moves: write the AI risk register for the current quarter; pick the trust artifact rollout date; commission an external red-team for the next major release. Anthropic's writeup on the challenges of evaluating AI systems is a useful supplement on what frontier-lab honesty looks like ⁷.

CAUTION

The EU AI Act timelines are real. If your product touches EU users, the GPAI obligations are live and the full-applicability date is firm. Steps 2 and 3 are not optional; they are the audit-binder material your legal team will request.

Hamel Husain, "Your AI Product Needs Evals." ↩
European Commission, EU AI Act official portal. ↩
NIST AI Risk Management Framework. ↩
OWASP Top 10 for LLM Applications. ↩
Anthropic Responsible Scaling Policy. ↩
OpenAI Preparedness Framework. ↩
Anthropic, "Challenges in evaluating AI systems." ↩