Production & Observability

Online vs offline, trace schema, drift detection, CI integration. Also available as a free PDF download.

Once a system has real users, evals stop being a notebook exercise and become a production discipline. The work shifts from "did the model do well on a held-out set" to "is the model still doing well on the traffic we got today, on the slices we care about, at the cost and latency we promised." That shift is mostly about plumbing: a trace schema that captures every input the model saw, a sampling policy that decides which traces get judged, drift detectors that alarm before users complain, and a CI pipeline that runs the offline suite on every change.

The pages here treat production evals as the steady state and offline evals as the prequel. Online runs continuously and samples; offline runs on a frozen dataset for every change. Both write into the same metric ledger so a regression in CI and a degradation in production share a vocabulary. Get the trace schema right first; everything downstream depends on it. The condensed references are online evaluation patterns and guardrails versus evals architecture.

Offline: gates and CI

Online vs offline: where each fires. Frozen-dataset runs on every change vs continuous sampling of live traffic, and the questions each answers.
Cost and latency in the scorecard. Quality, cost, and latency on one scorecard, with the pick made on the Pareto frontier.
Wiring evals into CI. Smoke, full, and nightly tiers, cost caps, sharding, and results published to the PR.
Release gates from eval results. Define the bands, automate the comparison, and hold the line when a regression candidate wants out.

Online: monitoring and response

Trace schema: what to log. The fields that matter and the storage shape that keeps downstream queries cheap.
Sampling strategies for production traces. Why uniform sampling shortchanges tail slices; stratify, oversample failures, reweight.
Drift detection. Track input and output distributions, and decide which shifts deserve a page.
Incident response and the AI postmortem. Why the root cause is rarely a line of code and the actions are usually eval-set changes.