Once a system has real users, evals stop being a notebook exercise and become a production discipline. The work shifts from "did the model do well on a held-out set" to "is the model still doing well on the traffic we got today, on the slices we care about, at the cost and latency we promised." That shift is mostly about plumbing: a trace schema that captures every input the model saw, a sampling policy that decides which traces get judged, drift detectors that alarm before users complain, and a CI pipeline that runs the offline suite on every change.
The pages here treat production evals as the steady state and offline evals as the prequel. Online runs continuously and samples; offline runs on a frozen dataset for every change. Both write into the same metric ledger so a regression in CI and a degradation in production share a vocabulary. Get the trace schema right first; everything downstream depends on it.
Chapters: