A risk register is the operational artifact that translates a high-level AI policy into a list of specific risks an engineering team works against. The format is borrowed from operational risk management in financial services and adapted for AI-specific failure modes. Done well, it is the single document that connects the eval program to the governance program: each row points to an eval that probes for the risk, and each eval gates a release. Done poorly, it is a spreadsheet that nobody reads after the audit.
The argument: keep the register small enough to read in fifteen minutes. Twenty to forty rows for a typical mid-sized product. Each row owns a specific eval. The register is reviewed quarterly with the same people who own the eval results.
Columns that matter
| Column | What it holds |
|---|
| ID | Stable identifier; rows do not get renumbered when rows are added or removed |
| Risk | One-sentence description of the failure mode |
| Source | NIST RMF function, OWASP LLM category, or product-specific |
| Inherent severity | 1 to 5 score before mitigation |
| Likelihood | 1 to 5 score of occurrence frequency in production |
| Inherent risk | Severity times likelihood |
| Mitigation | Specific controls (eval suite row, guardrail, training data filter, system prompt clause) |
| Residual severity | 1 to 5 after mitigation |
| Residual likelihood | 1 to 5 after mitigation |
| Residual risk | Residual severity times residual likelihood |
| Eval task | Specific row in the standing eval suite that probes this risk |
| Owner | Single named person |
| Last reviewed | Date of last quarterly review |
The two columns most teams skip and most need: Eval task and Owner. Without the eval-task link, the register drifts from what the engineering team actually tests. Without a named owner per row, ownership defaults to nobody.
The scoring scale
A consistent five-point scale across severity and likelihood is enough. Adding more granularity (ten points, twenty points) does not produce more actionable answers; it produces more arguments. The scale below has worked in practice.
| Score | Severity | Likelihood |
|---|
| 1 | Minor user inconvenience; no compliance issue | Rare; less than once per quarter |
| 2 | Notable user impact; no compliance issue | Occasional; once per quarter |
| 3 | Significant user impact or minor compliance issue | Monthly |
| 4 | Major user harm, regulatory exposure, or material breach | Weekly |
| 5 | Catastrophic; existential business or user impact | Daily or more |
The product (severity times likelihood) gives a 1 to 25 score per row. Rows over 15 are red; 6 to 15 are amber; under 6 are green. The thresholds are conventions; pick them once and stick with them.
A worked row
| ID | R-007 |
|---|
| Risk | Customer service bot discloses another customer's PII through prompt injection in shared context |
| Source | OWASP LLM01 (Prompt Injection) plus LLM02 (Sensitive Information Disclosure) |
| Inherent severity | 4 |
| Likelihood | 3 |
| Inherent risk | 12 (amber) |
| Mitigation | Input filter for injection patterns, output filter scanning for PII canaries, per-conversation session isolation, audit logging |
| Residual severity | 4 |
| Residual likelihood | 1 |
| Residual risk | 4 (green) |
| Eval task | safety/pii-canary-injection.json (run weekly) |
| Owner | @principal-ai-eng |
| Last reviewed | 2026-04-01 |
The row above is detailed enough to be actionable and short enough to be reviewable. Multiply by twenty to forty risks; that is the register.
What goes in, what stays out
The register is for risks where the cost of a single occurrence is non-trivial. Slow-rolling quality issues live in the standing eval suite, not in the register. The register's job is to give a single accountable owner per risk and to point at the specific eval that probes for it. Standard engineering bugs do not need a register row.
Common categories of risks that should be in the register:
- All ten OWASP LLM categories at minimum 1
- Demographic-bias risks where regulatory exposure is real
- Cost-overrun risks (token consumption, runaway loops)
- IP-disclosure risks
- Regulatory-specific risks per jurisdiction your product is sold in
- Reputational risks where a single news story would be material
The Anthropic discussion of evaluation challenges has useful framing on which risks are well-served by current eval techniques and which are not 2. Risks where current eval is weak (subtle long-tail behaviors, low-frequency catastrophic failures) need a register row even if the eval is heuristic, because the existence of the row is what triggers the conversation about better eval next quarter.
Review cadence
Quarterly review with the eval, security, product, and engineering leads. Sixty minutes. The agenda has three parts:
- Status walk-through of red and amber rows (fifteen minutes). What changed; what is blocking the residual rate from coming down.
- New rows (ten minutes). Anything from production incidents, red-team findings, regulatory updates, or new product surfaces.
- Eval-task updates (twenty minutes). Which rows need a better eval; which rows have eval drift; which eval results moved this quarter. The remaining time goes to owner reassignments and out-of-band items.
The quarterly review is the meeting that anchors the program. Skipping a quarter is fine once; skipping two quarters means the register is not load-bearing and you have governance theatre.
Mapping to the eval suite
Every row should point to a specific eval. The eval can be:
- A row in the standing offline suite (run on every release)
- A safety-bench probe (HarmBench, AILuminate slice)
- A red-team standing corpus check
- A production monitoring threshold
- A pre-deployment manual review for risks too rare or contextual to test automatically
The manual-review case is legitimate; it is also the case most likely to decay. If a row has a manual-review eval task for two consecutive quarters without an attempt to automate, something is wrong: either the row is not as material as the score implies, or the eval needs more investment.
The register as audit evidence
The risk register, when kept current, is the single best document to hand to an auditor, a customer's procurement team, or a board sub-committee asking about AI risk. The NIST RMF Manage function specifically calls out a risk-tracking artifact as evidence of the management function being in place 3. The register is that artifact.
What to do this month
- Create the spreadsheet (or table in your governance tool). Use the column list above verbatim until you have a reason to change it.
- Seed with the OWASP Top 10 categories (ten rows). Add three to five product-specific risks.
- For each row, identify the eval task that probes it. If none exists, mark "TODO: build" and assign an owner.
- Schedule the first quarterly review for ninety days out. Put the four named leads on the invite.
The register exists in working form within a month. The quarterly cadence sustains it. The next chapters cover the customer-facing artifacts (trust artifacts) and the executive surface (board readouts).