Building an AI risk register

A practical template: columns, scoring scale, review cadence, and the mapping from register row to eval task.

A risk register is the operational artifact that translates a high-level AI policy into a list of specific risks an engineering team works against. The format is borrowed from operational risk management in financial services and adapted for AI-specific failure modes. Done well, it is the single document that connects the eval program to the governance program: each row points to an eval that probes for the risk, and each eval gates a release. Done poorly, it is a spreadsheet that nobody reads after the audit.

The argument: keep the register small enough to read in fifteen minutes. Twenty to forty rows for a typical mid-sized product. Each row owns a specific eval. The register is reviewed quarterly with the same people who own the eval results.

Columns that matter

Column	What it holds
ID	Stable identifier; rows do not get renumbered when rows are added or removed
Risk	One-sentence description of the failure mode
Source	NIST RMF function, OWASP LLM category, or product-specific
Inherent severity	1 to 5 score before mitigation
Likelihood	1 to 5 score of occurrence frequency in production
Inherent risk	Severity times likelihood
Mitigation	Specific controls (eval suite row, guardrail, training data filter, system prompt clause)
Residual severity	1 to 5 after mitigation
Residual likelihood	1 to 5 after mitigation
Residual risk	Residual severity times residual likelihood
Eval task	Specific row in the standing eval suite that probes this risk
Owner	Single named person
Last reviewed	Date of last quarterly review

The two columns most teams skip and most need: Eval task and Owner. Without the eval-task link, the register drifts from what the engineering team actually tests. Without a named owner per row, ownership defaults to nobody.

The scoring scale

A consistent five-point scale across severity and likelihood is enough. Adding more granularity (ten points, twenty points) does not produce more actionable answers; it produces more arguments. The scale below has worked in practice.

Score	Severity	Likelihood
1	Minor user inconvenience; no compliance issue	Rare; less than once per quarter
2	Notable user impact; no compliance issue	Occasional; once per quarter
3	Significant user impact or minor compliance issue	Monthly
4	Major user harm, regulatory exposure, or material breach	Weekly
5	Catastrophic; existential business or user impact	Daily or more

The product (severity times likelihood) gives a 1 to 25 score per row. Rows over 15 are red; 6 to 15 are amber; under 6 are green. The thresholds are conventions; pick them once and stick with them.

A worked row

ID	R-007
Risk	Customer service bot discloses another customer's PII through prompt injection in shared context
Source	OWASP LLM01 (Prompt Injection) plus LLM02 (Sensitive Information Disclosure)
Inherent severity	4
Likelihood	3
Inherent risk	12 (amber)
Mitigation	Input filter for injection patterns, output filter scanning for PII canaries, per-conversation session isolation, audit logging
Residual severity	4
Residual likelihood	1
Residual risk	4 (green)
Eval task	safety/pii-canary-injection.json (run weekly)
Owner	@principal-ai-eng
Last reviewed	2026-04-01

The row above is detailed enough to be actionable and short enough to be reviewable. Multiply by twenty to forty risks; that is the register.

What goes in, what stays out

The register is for risks where the cost of a single occurrence is non-trivial. Slow-rolling quality issues live in the standing eval suite, not in the register. The register's job is to give a single accountable owner per risk and to point at the specific eval that probes for it. Standard engineering bugs do not need a register row.

Common categories of risks that should be in the register:

All ten OWASP LLM categories at minimum ¹
Demographic-bias risks where regulatory exposure is real
Cost-overrun risks (token consumption, runaway loops)
IP-disclosure risks
Regulatory-specific risks per jurisdiction your product is sold in
Reputational risks where a single news story would be material

The Anthropic discussion of evaluation challenges has useful framing on which risks are well-served by current eval techniques and which are not ². Risks where current eval is weak (subtle long-tail behaviors, low-frequency catastrophic failures) need a register row even if the eval is heuristic, because the existence of the row is what triggers the conversation about better eval next quarter.

Review cadence

Quarterly review with the eval, security, product, and engineering leads. Sixty minutes. The agenda has three parts:

Status walk-through of red and amber rows (fifteen minutes). What changed; what is blocking the residual rate from coming down.
New rows (ten minutes). Anything from production incidents, red-team findings, regulatory updates, or new product surfaces.
Eval-task updates (twenty minutes). Which rows need a better eval; which rows have eval drift; which eval results moved this quarter. The remaining time goes to owner reassignments and out-of-band items.

The quarterly review is the meeting that anchors the program. Skipping a quarter is fine once; skipping two quarters means the register is not load-bearing and you have governance theatre.

Mapping to the eval suite

Every row should point to a specific eval. The eval can be:

A row in the standing offline suite (run on every release)
A safety-bench probe (HarmBench, AILuminate slice)
A red-team standing corpus check
A production monitoring threshold
A pre-deployment manual review for risks too rare or contextual to test automatically

The manual-review case is legitimate; it is also the case most likely to decay. If a row has a manual-review eval task for two consecutive quarters without an attempt to automate, something is wrong: either the row is not as material as the score implies, or the eval needs more investment.

The register as audit evidence

The risk register, when kept current, is the single best document to hand to an auditor, a customer's procurement team, or a board sub-committee asking about AI risk. The NIST RMF Manage function specifically calls out a risk-tracking artifact as evidence of the management function being in place ³. The register is that artifact.

What to do this month

Create the spreadsheet (or table in your governance tool). Use the column list above verbatim until you have a reason to change it.
Seed with the OWASP Top 10 categories (ten rows). Add three to five product-specific risks.
For each row, identify the eval task that probes it. If none exists, mark "TODO: build" and assign an owner.
Schedule the first quarterly review for ninety days out. Put the four named leads on the invite.

The register exists in working form within a month. The quarterly cadence sustains it. The next chapters cover the customer-facing artifacts (trust artifacts) and the executive surface (board readouts).

OWASP, "Top 10 for LLM Applications." https://owasp.org/www-project-top-10-for-large-language-model-applications/ ↩
Anthropic, "Challenges in Evaluating AI Systems." https://www.anthropic.com/news/evaluating-ai-systems ↩
NIST, "AI Risk Management Framework." https://www.nist.gov/itl/ai-risk-management-framework ↩