Offline evals run a frozen dataset through a candidate system and ask "did this change make the metric go up." Online evals sample live traffic through the deployed system and ask "is the metric still where we left it." Teams that conflate the two end up with one half of the picture: a CI green light that masks a production regression, or a production dashboard that drifts without ever catching a bad PR.
The distinction matters because the two answer questions the other cannot. Offline isolates the variable you changed; online tells you whether the change holds on the inputs users actually send. Build both. Wire them to the same metric vocabulary. Let the offline suite block the release and the online suite catch the slow leaks 1.
The two regimes
Offline runs on a curated set of inputs you control: a regression suite, a golden set, a labeled dev split. The inputs are fixed, the outputs are scored against a known rubric or reference, and the comparison is a candidate model or prompt against a baseline. Latency and throughput do not enter the score unless you put them there. The point is to decide whether to merge a change 2.
Online runs on a sample of live traffic. The inputs are whatever real users send, the outputs are whatever the deployed system produces, and the scores feed dashboards and alerts. Cost, latency, and quality all share the same trace. The point is to detect drift between releases, catch slices that the offline suite missed, and gather inputs for the next dataset refresh 1.
The two regimes use the same judge code wherever possible. The same score_faithfulness(trace, context) function should run in CI and on a 2% sample of prod. If it diverges, every difference between an offline pass and an online failure becomes ambiguous: is it the inputs, the judge version, or the model.
Where each fires in a request lifecycle
sequenceDiagram
participant Dev as Developer
participant CI as CI pipeline
participant Gate as Release gate
participant User as User
participant App as Application
participant Guard as Guardrails
participant Trace as Trace store
participant Eval as Online judge
Dev->>CI: open PR
CI->>CI: offline eval on golden set
CI->>Gate: pass/fail + delta
Gate->>Gate: human sign-off
Gate-->>Dev: merged
User->>App: request
App->>Guard: pre-call checks
Guard->>App: allow / block
App->>App: model call + tools
App->>Trace: write trace
App->>User: response
Trace->>Eval: sampled async
Eval->>Trace: judge labels + scores
Guardrails fire on every request before the response leaves the box: PII redaction, schema validation, refusal on policy violations, max-token cuts 1. They are runtime checks, not evals. Evals are observations on a sample after the fact. Confusing the two leads to teams asking a judge to enforce policy synchronously, which is slow and expensive, or skipping guardrails because "the eval will catch it," which it will not because by then the bad response shipped.
When to use which
Offline is the only place to compare a candidate prompt against the production prompt without exposing users to the candidate. It is the only place you can re-run the same inputs across model upgrades to get a clean A vs B delta. It is the only place where statistical comparisons are clean enough to report a confidence interval that means what it looks like 3.
Online is the only place to detect that your retrieval started returning out-of-date documents, that users started sending a new kind of question, or that the model is suddenly verbose because the upstream provider quietly retrained. It is the only place where the input distribution matches reality.
A useful rule from Huyen: instrument the prod loop daily, run the offline loop on every change 1. The cadence asymmetry mirrors the question asymmetry. Online drifts on a continuous clock and gets sampled accordingly; offline stays frozen until a change forces a rerun.
The metric ledger
Both regimes write into one ledger. Each row is (trace_id, eval_name, score, version_of_judge, version_of_dataset_or_window, timestamp). The dataset version is a content hash for offline runs and a window identifier (2026-W21) for online runs. Pinning every score to a judge version and a dataset or window lets you answer the only question that matters when a number moves: did the score move because the system got worse, or because we changed what we were measuring.
# Same judge runs in offline CI and online sampling.
# The only difference is the source of traces and the cadence.
from typing import Iterable
def evaluate_traces(traces: Iterable[dict], judge_version: str) -> list[dict]:
rows = []
for trace in traces:
score = score_faithfulness(
query=trace["query"],
response=trace["response"],
context=trace["retrieved_chunks"],
judge_version=judge_version,
)
rows.append({
"trace_id": trace["trace_id"],
"eval_name": "faithfulness",
"score": score["label"],
"judge_version": judge_version,
"context_version": trace.get("context_version", "online"),
"ts": trace["ts"],
})
return rows
# Offline: deterministic set, every commit.
offline_traces = load_dataset("rag-golden@v17")
offline_rows = evaluate_traces(offline_traces, judge_version="faithfulness@2026-05-12")
write_to_ledger("offline", offline_rows)
# Online: sampled prod, continuous.
def online_sampler(stream, sample_rate=0.02):
for trace in stream:
if random.random() < sample_rate:
yield trace
prod_stream = read_traces_from_warehouse(window="last_24h")
online_rows = evaluate_traces(online_sampler(prod_stream), judge_version="faithfulness@2026-05-12")
write_to_ledger("online", online_rows)
The judge code is one function. The harness around it changes, the trace source changes, the dataset or window identifier changes. Keep it that way. Two separate judges that drift apart are the most common source of "the offline suite said this was a 12-point win but production held flat."
Sampling policy is a design decision
Offline does not sample; it runs the whole dataset every time. Online has to sample or the judge cost will exceed the model cost. The default is uniform random at 1% to 5%, which is cheap and unbiased. The defaults break the moment you care about a tail slice: a 1% sample of all traffic might contain zero traces from a 0.5% segment that matters 4.
Stratified sampling fixes that. Bucket the trace by the dimension you care about (intent, persona, language, latency bucket) and sample uniformly within each bucket. That gives you a usable signal on each stratum at the cost of a slight bias on the overall mean, which you correct by reweighting. The Langfuse and Phoenix observability platforms expose enough trace metadata to make stratification a few-line filter 5 6.
CITATION
Chip Huyen, "Building A Generative AI Platform" (huyenchip.com, July 2024). The canonical reference for treating monitoring, guardrails, and offline evals as three different mechanisms that share the same trace data. The piece argues explicitly for daily inspection of production traces as the work that drives the next eval. huyenchip.com/2024/07/25/genai-platform.html
The wrong way to do this
Three patterns recur and all three are worth flagging.
The first is running the offline suite on production and calling that online eval. The offline set is curated; the production stream is not. A pass on the offline set means the model handles 200 known inputs; it tells you nothing about the new ones you got yesterday. If your "online eval" is just the offline judge on the offline inputs, you have one eval running twice, not two evals.
The second is sampling at 100% in online mode because "judge calls are cheap." They are cheap relative to a frontier-model call until you are doing a million a day. A 2% sample of a million traces is twenty thousand judge calls; ten cents apiece is two thousand dollars a day. Sample on purpose. Stratify on purpose. Use a distilled judge for the online path when scale demands it 7.
The third is treating the online judge score as if it had the same statistical properties as the offline score. Online traces are not i.i.d. across users; the same user often sends correlated queries, the same conversation contains correlated turns. Naive standard errors on online aggregates undercount the variance, sometimes by a factor of three 3. The clustered standard errors page covers the correction.
The handoffs
A working production eval program runs four loops, and the handoffs between them matter more than any single component.
flowchart LR
A[Live traffic] --> B[Trace store]
B --> C[Online judge<br/>sampled, 1-5%]
C --> D[Quality dashboard]
D --> E[Drift alert]
E --> F[Error analysis<br/>weekly]
F --> G[Dataset refresh]
G --> H[Offline judge<br/>every PR]
H --> I[Release gate]
I --> A
Live traffic writes to the trace store. The online judge samples and scores. The dashboard surfaces drift. Drift triggers an error-analysis pass on a weekly cadence 8. Error analysis refreshes the offline dataset with the new failure modes. The offline judge runs on every PR against the refreshed set. The release gate blocks merges that drop the score. The cycle repeats.
If any one of these handoffs is broken, the others stop being useful. A trace store with no online judge is a search index, not an eval program. An online judge with no error-analysis cadence becomes ambient noise. Error analysis with no dataset refresh feeds back into nothing. The offline suite with no release gate is theatre.
Where to go next
The trace schema page covers what fields make all four loops possible. The sampling strategies page covers the policies that decide which traces the online judge sees. The drift detection page covers the alarms. The CI integration page covers the offline release gate. The cost and latency page argues for putting non-quality dimensions in the same scorecard as quality.