Online evaluation runs on production traffic. The offline program tells you whether a candidate meets your bar before deploy; the online program tells you whether it keeps meeting that bar after. Huyen frames it as "evaluation is necessary at every step" with the heavy lifting in production shifting to observability 1. The four patterns are sampling, sampled judging, drift detection, and A/B with intervals.
Trace sampling
Every production trace is logged with input, output, model version, prompt version, latency, tokens, and metadata. Schema covered on the trace schema chapter.
Sampling answers two questions: which traces go to human review (the open-coding pool) and which go to the online judge (the quality-drift signal).
Uniform random for unbiased aggregates. Standard pick when the eval is computing an overall metric. Sampling rate 1% to 5%.
Stratified for coverage. When high-stakes segments are under-represented (enterprise, regulated markets), stratify and oversample. Reweight when aggregating.
Failure-biased for error analysis. When base failure rate is low (1-2%), uniform sampling buys two failures per 100 traces. Oversample failure-flagged traces (by guardrail trigger, user thumbs-down, retry signal) to make the human-review queue worth opening 2.
A bias-free aggregate needs uniform; a useful error-analysis queue needs failure-biased. Run both, label them differently, never mix the samples when computing an aggregate.
Sampled judging at production
An LLM judge on every production trace is usually too expensive. The standard pattern is to judge a sampled subset (1-5%) and report the quality metric on the sampled population.
The judge runs asynchronously off the request path. The request returns immediately; the trace is queued; the judge runs in the background; the score lands within minutes 1. Langfuse, Phoenix, and Braintrust ship the pattern as a first-class concept 3 4.
The judge is the offline-calibrated one. TPR and TNR from offline calibration carry to production as long as the distribution matches; if it does not, the online metric is biased by an unknown amount and needs periodic recalibration against fresh human labels.
def handle_request(query):
response = generate(query)
trace_id = log_trace(query=query, response=response)
if random.random() < SAMPLE_RATE:
enqueue_judge(trace_id)
return response
def judge_worker():
while True:
trace_id = dequeue()
trace = load_trace(trace_id)
score = run_judge(trace.query, trace.response)
save_score(trace_id, score)
The judge's latency does not affect the user; only queue depth affects metric freshness. Judge cost lands on the eval team's budget, not the product team's, which is the right incentive structure. Judge version is pinned in the trace; recalibration is a new judge version, not a silent update.
Drift detection
Three signals are commonly tracked.
Input distribution drift. Length, language, topic shifting over time. KS test or PSI on the relevant features. A shift indicates the offline calibration set is no longer representative; rerun calibration.
Output distribution drift. Refusal rate, average length, judge-score distribution shifting. Shifts can mean the model changed (silent vendor update), the prompt changed, or the user mix changed.
Per-dimension score drift. Track the daily rolling pass rate per dimension against a baseline window. Wilson intervals on each daily proportion; flag when the daily estimate sits outside the baseline interval for three consecutive days 5.
| Signal | What it detects | Standard test |
|---|
| Input length | distribution shift | KS test, PSI |
| Output refusal rate | model behavior change | Wilson interval on daily rate |
| Judge score | quality drift | Wilson interval on daily pass rate |
| Per-feature pass rate | targeted regression | clustered SE; see statistical methods |
A/B at production scale
Treat online A/B as an experiment. Random assignment at request time; persistent assignment per user or per session; logging of the arm; eval on the per-arm population.
For binary metrics, Wilson or Clopper-Pearson per arm; difference-of-proportions for the headline delta. For continuous metrics, bootstrap on per-arm scores and the paired delta. The statistical methods page covers mechanics. Every A/B headline number ships with an interval.
Multiple-comparison correction applies when reporting more than one sub-metric. Pick the headline metric before assignment; report the rest with FDR correction. Chatbot Arena ships bootstrap intervals on every model's score for this reason 6.
Three failure modes. Stopping early at the first significant result inflates the false-positive rate; pre-register the analysis plan. Running an A/B without enough power detects nothing but spends user-trust budget; size for the smallest delta of interest. Letting the A/B continue indefinitely incurs Type II errors of a different kind; pre-register a stop date.
Where guardrails fit
Online eval observes; guardrails intervene. They share the trace stream but run at different latencies and have different consequences. The guardrails versus evals page covers the architecture. Rule of thumb: guardrails block the response on the critical path; online eval scores the response off the critical path.
A guardrail trigger is an excellent online-eval input. Failure-biased sampling that oversamples guardrail-triggered traces gives the open-coding pool the highest-information traces.
The minimum production stack
A trace store with the schema covered on trace schema. A sampling layer with uniform plus failure-biased modes. A judge worker pool running calibrated offline judges on 1-5% of traffic. A drift-detection cron job comparing daily statistics to a baseline window. An A/B framework that exposes intervals on every headline number. A weekly human-review queue from open coding 7.
Langfuse and Phoenix ship the first four out of the box 3 4. The A/B framework is usually homegrown or wrapped around the product-team's experimentation platform. The human-review queue almost never works without a custom data viewer; the data viewer chapter covers the build.
What good looks like
A production online-eval program ships sampled judging at 1-5% of traffic with the offline-calibrated judge, a drift dashboard with Wilson intervals on daily aggregates alerting on three-day deviations, an A/B framework with pre-registered analysis plans and bootstrap intervals on the headline delta, and a weekly cadence pulling failure-biased samples into the open-coding queue. At production scale, evaluation looks like observability with judgment plugged in 1.