Online vs offline: where each fires

Offline evals run on a frozen dataset for every change. Online evals sample real traffic continuously. They answer different questions and both belong in production.

Offline evals run a frozen dataset through a candidate system and ask "did this change make the metric go up." Online evals sample live traffic through the deployed system and ask "is the metric still where we left it." Teams that conflate the two end up with one half of the picture: a CI green light that masks a production regression, or a production dashboard that drifts without ever catching a bad PR.

The distinction matters because the two answer questions the other cannot. Offline isolates the variable you changed; online tells you whether the change holds on the inputs users actually send. Build both. Wire them to the same metric vocabulary. Let the offline suite block the release and the online suite catch the slow leaks ¹.

The two regimes

Offline runs on a curated set of inputs you control: a regression suite, a golden set, a labeled dev split. The inputs are fixed, the outputs are scored against a known rubric or reference, and the comparison is a candidate model or prompt against a baseline. Latency and throughput do not enter the score unless you put them there. The point is to decide whether to merge a change ².

Online runs on a sample of live traffic. The inputs are whatever real users send, the outputs are whatever the deployed system produces, and the scores feed dashboards and alerts. Cost, latency, and quality all share the same trace. The point is to detect drift between releases, catch slices that the offline suite missed, and gather inputs for the next dataset refresh ¹.

The two regimes use the same judge code wherever possible. The same score_faithfulness(trace, context) function should run in CI and on a 2% sample of prod. If it diverges, every difference between an offline pass and an online failure becomes ambiguous: is it the inputs, the judge version, or the model.

Where each fires in a request lifecycle

sequenceDiagram
    participant Dev as Developer
    participant CI as CI pipeline
    participant Gate as Release gate
    participant User as User
    participant App as Application
    participant Guard as Guardrails
    participant Trace as Trace store
    participant Eval as Online judge

    Dev->>CI: open PR
    CI->>CI: offline eval on golden set
    CI->>Gate: pass/fail + delta
    Gate->>Gate: human sign-off
    Gate-->>Dev: merged

    User->>App: request
    App->>Guard: pre-call checks
    Guard->>App: allow / block
    App->>App: model call + tools
    App->>Trace: write trace
    App->>User: response
    Trace->>Eval: sampled async
    Eval->>Trace: judge labels + scores

Guardrails fire on every request before the response leaves the box: PII redaction, schema validation, refusal on policy violations, max-token cuts ¹. They are runtime checks, not evals. Evals are observations on a sample after the fact. Confusing the two leads to teams asking a judge to enforce policy synchronously, which is slow and expensive, or skipping guardrails because "the eval will catch it," which it will not because by then the bad response shipped.

When to use which

Offline is the only place to compare a candidate prompt against the production prompt without exposing users to the candidate. It is the only place you can re-run the same inputs across model upgrades to get a clean A vs B delta. It is the only place where statistical comparisons are clean enough to report a confidence interval that means what it looks like ³.

Online is the only place to detect that your retrieval started returning out-of-date documents, that users started sending a new kind of question, or that the model is suddenly verbose because the upstream provider quietly retrained. It is the only place where the input distribution matches reality.

A useful rule from Huyen: instrument the prod loop daily, run the offline loop on every change ¹. The cadence asymmetry mirrors the question asymmetry. Online drifts on a continuous clock and gets sampled accordingly; offline stays frozen until a change forces a rerun.

The metric ledger

Both regimes write into one ledger. Each row is (trace_id, eval_name, score, version_of_judge, version_of_dataset_or_window, timestamp). The dataset version is a content hash for offline runs and a window identifier (2026-W21) for online runs. Pinning every score to a judge version and a dataset or window lets you answer the only question that matters when a number moves: did the score move because the system got worse, or because we changed what we were measuring.

# Same judge runs in offline CI and online sampling.
# The only difference is the source of traces and the cadence.
from typing import Iterable

def evaluate_traces(traces: Iterable[dict], judge_version: str) -> list[dict]:
    rows = []
    for trace in traces:
        score = score_faithfulness(
            query=trace["query"],
            response=trace["response"],
            context=trace["retrieved_chunks"],
            judge_version=judge_version,
        )
        rows.append({
            "trace_id": trace["trace_id"],
            "eval_name": "faithfulness",
            "score": score["label"],
            "judge_version": judge_version,
            "context_version": trace.get("context_version", "online"),
            "ts": trace["ts"],
        })
    return rows

# Offline: deterministic set, every commit.
offline_traces = load_dataset("rag-golden@v17")
offline_rows = evaluate_traces(offline_traces, judge_version="faithfulness@2026-05-12")
write_to_ledger("offline", offline_rows)

# Online: sampled prod, continuous.
def online_sampler(stream, sample_rate=0.02):
    for trace in stream:
        if random.random() < sample_rate:
            yield trace

prod_stream = read_traces_from_warehouse(window="last_24h")
online_rows = evaluate_traces(online_sampler(prod_stream), judge_version="faithfulness@2026-05-12")
write_to_ledger("online", online_rows)

The judge code is one function. The harness around it changes, the trace source changes, the dataset or window identifier changes. Keep it that way. Two separate judges that drift apart are the most common source of "the offline suite said this was a 12-point win but production held flat."

Sampling policy is a design decision

Offline does not sample; it runs the whole dataset every time. Online has to sample or the judge cost will exceed the model cost. The default is uniform random at 1% to 5%, which is cheap and unbiased. The defaults break the moment you care about a tail slice: a 1% sample of all traffic might contain zero traces from a 0.5% segment that matters ⁴.

Stratified sampling fixes that. Bucket the trace by the dimension you care about (intent, persona, language, latency bucket) and sample uniformly within each bucket. That gives you a usable signal on each stratum at the cost of a slight bias on the overall mean, which you correct by reweighting. The Langfuse and Phoenix observability platforms expose enough trace metadata to make stratification a few-line filter ⁵ ⁶.

CITATION

Chip Huyen, "Building A Generative AI Platform" (huyenchip.com, July 2024). The canonical reference for treating monitoring, guardrails, and offline evals as three different mechanisms that share the same trace data. The piece argues explicitly for daily inspection of production traces as the work that drives the next eval. huyenchip.com/2024/07/25/genai-platform.html

The wrong way to do this

Three patterns recur and all three are worth flagging.

The first is running the offline suite on production and calling that online eval. The offline set is curated; the production stream is not. A pass on the offline set means the model handles 200 known inputs; it tells you nothing about the new ones you got yesterday. If your "online eval" is just the offline judge on the offline inputs, you have one eval running twice, not two evals.

The second is sampling at 100% in online mode because "judge calls are cheap." They are cheap relative to a frontier-model call until you are doing a million a day. A 2% sample of a million traces is twenty thousand judge calls; ten cents apiece is two thousand dollars a day. Sample on purpose. Stratify on purpose. Use a distilled judge for the online path when scale demands it ⁷.

The third is treating the online judge score as if it had the same statistical properties as the offline score. Online traces are not i.i.d. across users; the same user often sends correlated queries, the same conversation contains correlated turns. Naive standard errors on online aggregates undercount the variance, sometimes by a factor of three ³. The clustered standard errors page covers the correction.

The handoffs

A working production eval program runs four loops, and the handoffs between them matter more than any single component.

flowchart LR
    A[Live traffic] --> B[Trace store]
    B --> C[Online judge<br/>sampled, 1-5%]
    C --> D[Quality dashboard]
    D --> E[Drift alert]
    E --> F[Error analysis<br/>weekly]
    F --> G[Dataset refresh]
    G --> H[Offline judge<br/>every PR]
    H --> I[Release gate]
    I --> A

Live traffic writes to the trace store. The online judge samples and scores. The dashboard surfaces drift. Drift triggers an error-analysis pass on a weekly cadence ⁸. Error analysis refreshes the offline dataset with the new failure modes. The offline judge runs on every PR against the refreshed set. The release gate blocks merges that drop the score. The cycle repeats.

If any one of these handoffs is broken, the others stop being useful. A trace store with no online judge is a search index, not an eval program. An online judge with no error-analysis cadence becomes ambient noise. Error analysis with no dataset refresh feeds back into nothing. The offline suite with no release gate is theatre.

Where to go next

The trace schema page covers what fields make all four loops possible. The sampling strategies page covers the policies that decide which traces the online judge sees. The drift detection page covers the alarms. The CI integration page covers the offline release gate. The cost and latency page argues for putting non-quality dimensions in the same scorecard as quality.

Chip Huyen, "Building A Generative AI Platform," July 2024. ↩ ↩² ↩³ ↩⁴
Hamel Husain, "Your AI Product Needs Evals." ↩
Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩ ↩²
Yan, Bischof, Howard, Shankar, Husain, "What We Learned from a Year of Building with LLMs." ↩
Langfuse documentation, online evaluation and sampling. ↩
Arize Phoenix tutorials, production observability. ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩
Hamel Husain, "A Field Guide to Rapidly Improving AI Products." ↩