Incident response for AI products

An AI postmortem looks different from a service postmortem. The root cause is rarely a line of code. The actions are usually eval-set changes.

A service-outage postmortem identifies a code change, a config error, or an infrastructure failure, then commits to a fix. An AI quality incident rarely looks like that. The root cause is usually a distribution shift, a prompt regression, a tool-API behaviour change, or a calibration drift in a judge nobody updated. The action items are usually eval-set additions, not code fixes. The template needs to acknowledge this or the team writes service-postmortem documents that fail to prevent the next AI regression.

The opinion: AI incidents need their own template, separate from the service-incident one, with four sections that map to where AI regressions actually originate. Detection (how did we learn), scope (which users and slices), root cause (which axis: model, prompt, judge, data, environment), and remediation (which evals get added, which gates get tightened) ¹.

What counts as an AI incident

Three categories that all need a postmortem.

The first is a user-visible quality regression: complaints, escalations, support volume, a public reaction. The trigger is downstream of the dashboards; the work is to align the dashboards so they catch the next one upstream.

The second is a release that shipped past the release gate (an override, a missed slice, an underspecified suite) and produced a measurable regression after the fact. The trigger is internal; the work is to tighten the gate and the eval set.

The third is a silent regression detected by the drift dashboards before users surfaced it. The trigger is the dashboard; the work is to verify the dashboard was right (sometimes it was the judge, not the model) and to add the failing pattern to the regression suite.

The four-section template

Use this template for every AI incident, even the small ones. Small incidents fill the sections quickly; the discipline pays off when a large incident overwhelms the team's improvisation.

Detection. When did the regression start. How did we learn. What was the time-to-detect from the start of the regression to the first credible alarm. If the dashboard caught it, which signal fired. If the user caught it, which signal should have. If the alarm fired but was ignored or paged the wrong rotation, write that down.

Scope. Which users and which slices were affected. What was the magnitude on each slice (a quality drop, a refusal swing, a tool-call failure rate). Use the same per-slice scorecard the cost-latency page lays out. Scope is what determines severity; do not assume "global" or "all users" without checking ².

Root cause. Which axis caused the regression. The five common axes:

Model. The model under the hood changed (provider retrain, version bump, fine-tune).
Prompt. The prompt or system message changed, on purpose or accidentally.
Judge. The judge calibration drifted; the score moved without the system moving.
Data. The retrieval index changed, the dataset version drifted, a feature flag enabled a new path.
Environment. A tool API changed; a third-party service started erroring.

A single root cause is rare in practice; usually two or three of these moved at once and the question is which moved first. Tie the answer to evidence in traces, not memory.

Remediation. What changes ship. Three categories:

Code changes (prompt revert, judge rollback, tool-API adapter).
Eval-set changes (add the failing pattern as a regression case; calibrate the judge against new labels; update the reference window).
Gate or threshold changes (tighten the indifference window on the regressed axis; add a slice to the per-slice scorecard; add an alarm).

The Field Guide and the Evals FAQ both argue that the eval-set change is usually the most important of the three because it is what prevents the same regression from recurring ³ ⁴.

A scaffold

# Incident <id>: <one-line title>

- **Severity:** S1 | S2 | S3
- **Detection time:** <UTC>
- **Resolution time:** <UTC>
- **Time-to-detect:** <Δ>
- **Time-to-mitigate:** <Δ>
- **Owner:** @<handle>

## Detection
- How: <dashboard alert / user report / internal observation>
- Which signal: <name and link>
- What we missed: <signals that should have fired earlier>

## Scope
| Slice | Baseline metric | Incident metric | Delta | Users affected |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |

## Root cause
- **Axis:** model | prompt | judge | data | environment
- **What changed:** <link to commit / config / provider notice>
- **Why we did not catch it:** <eval coverage gap, threshold band, calibration drift>

## Remediation
- Code: <PR links>
- Eval set: <new regression cases added; link to dataset diff>
- Gates and thresholds: <changes to the release gate or drift alarms>
- Calibration: <judge re-calibration plan, if applicable>

## Action items
| # | Action | Owner | Due |
|---|---|---|---|
| 1 | ... | @... | YYYY-MM-DD |

The action-item table is the part that gets checked at retros. If most action items get marked "done" without an artefact (a PR, a dataset diff, a calibration record), the incident did not actually prevent the next one.

Why traditional postmortem templates underperform here

Three patterns recur when an AI team uses a service-incident template.

The first is "root cause: model regression" with no further specificity. The model is the box; the question is which axis inside the box. Without a forced selection from the five axes above, the team writes "the model got worse" and leaves no path to prevention.

The second is "action item: add an alert." Alerts without evals are notifications. The right action is usually a regression case plus an alarm; the alarm without the case fires once and goes silent.

The third is "we will calibrate the judge." Calibration without a labelled set is wishful. The right action is a calibration record (twenty labels, kappa, TPR, TNR) tied to the incident, with the calibration set added to the regression suite ⁵ ⁶.

What about model-provider incidents

A class of incident not covered by your code: the upstream provider quietly retrained, deprecated a model, or changed safety behaviour. The detection is identical; the remediation is partly procedural. Pin the model version at the request level (not just the SDK level). Maintain a rollback path to a previous version. Subscribe to provider release notes; add a daily smoke against a canonical prompt that detects behaviour changes.

The Anthropic and OpenAI status pages catch the loud incidents; the quiet ones (a fine-tune that nudged a behaviour) only surface through your own evals. A regression caught by the customer before it shows up in your evals is the most expensive class of incident; the regression suite is what closes the gap ².

TIP

An incident that does not produce at least one new eval-set entry was misunderstood. The point of the postmortem is to make the next instance of the same failure block CI, not to assign blame ³.

What to do this week

Write the template above in your repo's incident folder. Replace the service-incident template for AI workflows.
Run the template against the last incident you handled. If you cannot fill the root-cause axis, the eval program is underspecified on that axis.
Audit open action items from past incidents. Each one should have either a PR link or a regression case added; if it has neither, the action item is decoration.

Every postmortem produces numbers (a failure rate before, a failure rate after), and telling a real recovery from noise in those numbers is the work of Statistics and Rigor.

Chip Huyen, "Building A Generative AI Platform" (2024). ↩
Applied LLMs, "What We Learned from a Year of Building with LLMs." ↩ ↩²
Hamel Husain, "A Field Guide to Rapidly Improving AI Products." ↩ ↩²
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩
Moshkovich et al., "Beyond Black-Box Benchmarking" (2025). arXiv:2503.06745. ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩