Drift detection for production AI

Track input and output distributions over time. Page on changes that matter; ignore the ones that do not. The hardest part is deciding which is which.

A drift alarm that fires every day is not an alarm; it is noise. A drift alarm that never fires is not protection; it is a sleeping switch. Useful drift detection lives in the middle, paging only when the distribution change is large enough to matter for the metric the page targets. The work is mostly in the threshold, not in the math.

The pattern that holds up: track three families of signals (input shape, output shape, judge score), compare a rolling recent window against a frozen reference window, and page when the change exceeds a threshold tuned to your false-positive tolerance. The reference window is what you ship against; the rolling window is what production is doing now. The threshold is the contract between you and on-call.

What to track

Three families cover most production drift in practice. Each is cheap to log on every trace and informative on its own.

Input drift. What users send. Useful signals: prompt length distribution (median and p95), language mix, intent mix (if classified), retrieved-context document age (for RAG), input embedding centroid shift. The most common production-bug signal here is "a new traffic source started sending a kind of input the system was never tested on." The Field Guide makes the case that this kind of input shift is usually the first thing that surfaces a real failure cluster ¹.

Output drift. What the system returns. Useful signals: response length distribution, refusal rate, tool-call rate (for agents), output structure validity rate, language mix, parse-error rate. Output drift detached from input drift is the loudest "the model changed under us" signal, usually triggered by an upstream provider quietly retraining.

Judge-score drift. What the online judge thinks of the outputs. Useful signals: per-eval score distribution, per-stratum pass rate, distribution of low-confidence judgments. Judge-score drift is the closest thing to a quality canary, but it is downstream of judge stability; pin the judge version and watch separately for judge regressions ².

Reference windows are versioned artefacts

The reference window is not "yesterday" or "the last week"; it is the window the current production model was validated against. Pin it when you cut a release. When a release ships, the previous reference moves to a "historical" slot, the new reference is the validation window of the new release, and the rolling window starts comparing against the new reference. Treat the reference window like a dataset version: content-hashed, immutable, tied to a release tag.

Shankar's flywheel framing captures this discipline: production traces feed back into the offline dataset and the reference window in lockstep, and a release that changes either without acknowledging the other is a process bug ³.

Detection that works in practice

For most signals, a population-stability index or a two-sample Kolmogorov-Smirnov test on the rolling versus reference window is enough. For pass-rate or refusal-rate, a two-proportion Wilson interval comparison gets you both the point estimate and the interval in two lines. Neither method is special; the discipline is in the threshold tuning and in the reweighted comparison.

"""
Three drift checks: KS on a numeric signal, PSI on a categorical signal,
and a two-proportion Wilson comparison on a rate. No external deps
beyond scipy and statsmodels.
"""
import numpy as np
from scipy import stats
from statsmodels.stats.proportion import confint_proportions_2indep

def ks_drift(reference: list[float], current: list[float]) -> dict:
    stat, p = stats.ks_2samp(reference, current)
    return {"stat": float(stat), "p_value": float(p)}

def psi(reference: list[str], current: list[str], categories: list[str]) -> float:
    """Population Stability Index over a categorical distribution."""
    def proportions(xs):
        return np.array([(sum(1 for x in xs if x == c) + 1e-6) / (len(xs) + 1e-6) for c in categories])
    p_ref = proportions(reference)
    p_cur = proportions(current)
    return float(np.sum((p_cur - p_ref) * np.log(p_cur / p_ref)))

def rate_diff(
    ref_success: int, ref_n: int,
    cur_success: int, cur_n: int,
    alpha: float = 0.05,
):
    low, high = confint_proportions_2indep(
        cur_success, cur_n, ref_success, ref_n, alpha=alpha, method="score"
    )
    p_cur = cur_success / cur_n
    p_ref = ref_success / ref_n
    return {"delta": p_cur - p_ref, "ci": (low, high), "p_cur": p_cur, "p_ref": p_ref}

Two thresholds worth committing to as defaults, with the caveat that you should tune them on a history of "false alarms vs real incidents" rather than picking them off a page.

PSI greater than 0.2 on any monitored categorical distribution is a strong drift signal. PSI between 0.1 and 0.2 is investigation territory.
Rate delta whose 95% Wilson interval excludes zero on a rate metric is a candidate page; the size of the delta determines severity.

Why naive alarms fail

Three failure modes that show up in production drift programs reliably.

The first is alarming on the global mean while the strata move in opposite directions. Two intents both moved in different directions; the global mean is flat; the program is on fire. Run drift detection per-stratum, not just on the aggregate, and report each per-stratum interval ⁴.

The second is treating the judge as a fixed reference. Judges drift too: model version updates, prompt edits, calibration churn. A page that reads "judge score dropped" must distinguish a model-quality drop from a judge regression. The fix is to pin the judge version on every score row and to run a smoke-test set of held-out human labels against the current judge on the same cadence as the production sample.

The third is reporting a non-clustered standard error on a stream where users contribute multiple traces. The variance is larger than the naive formula reports, sometimes by a factor of three ⁵. Use a clustered standard error or a paired comparison; the clustered standard errors page covers the correction.

What to do when an alarm fires

The runbook is simple and worth committing before the first page:

Cross-reference with the input-drift signals. If input changed too, the question is "did the model handle the new input correctly," not "did the model regress."
Check judge stability. Run the held-out judge calibration set against the current judge version. If judge agreement against the gold labels dropped, the issue is the judge, not the model.
Sample twenty drifted traces and read them. Open-coding is the only way to tell whether the drift represents a real quality change or a benign distribution shift. The Field Guide pattern of reading raw traces remains the highest-signal step in any drift investigation ¹.

TIP

A drift alarm is the start of an investigation, not a verdict. The point of the alarm is to direct attention; the verdict comes from reading data ⁶.

What to do this week

Pin a reference window for each production workflow. Tie it to the release tag.
Stand up per-stratum drift detection on input length, intent mix, refusal rate, and judge score. Default thresholds at PSI 0.2 and rate-delta confidence intervals excluding zero.
Write the runbook above on the page that owns the alarm. Drift detection without a runbook is a notification, not a control.

Hamel Husain, "A Field Guide to Rapidly Improving AI Products." ↩ ↩²
Moshkovich et al., "Beyond Black-Box Benchmarking" (2025). arXiv:2503.06745. ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩
Applied LLMs, "What We Learned from a Year of Building with LLMs." ↩
Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩
Chip Huyen, "Building A Generative AI Platform" (2024). ↩