AI Evals › Foundations

Eval vs monitoring vs guardrails

Three concepts that share traces but do different jobs, with the architectural placement of each in a request lifecycle.

Teams collapse three different jobs into one word ("evals") and then notice the dashboard they built does not do what they need. The three jobs are evaluation, monitoring, and guardrails. They share request traces and overlap in tooling, but they fire at different points in the lifecycle and answer different questions. Confusing them is the most common reason an eval program looks busy and produces no decisions.

The three concepts

Evaluation is offline measurement on a fixed labeled dataset. It answers "did this change make the system better." Monitoring is online observation of production traces over time. It answers "is the live system drifting." Guardrails are per-request inline checks that block or transform individual responses. They answer "should this specific output go out." All three are necessary; teams who build one and skip the others ship blind spots in different places ^[1].

Where each fires in the lifecycle

sequenceDiagram
    participant U as User
    participant G as Guardrail (inline)
    participant L as LLM system
    participant T as Trace store
    participant M as Monitor (online)
    participant E as Eval (offline, batch)

    U->>L: request
    L->>G: candidate output
    G-->>L: pass / block / transform
    L->>U: response
    L->>T: write trace
    T->>M: sampled stream
    M-->>M: score, alert on drift
    T->>E: nightly batch on labeled set
    E-->>E: compare against baseline

Guardrails are the only one of the three that can stop a response from reaching the user; they are the only layer synchronous to the request. Monitoring runs after the response is sent and cannot retract a bad answer. Evaluation runs detached from the request; the result gates merges and releases.

What each is for

Evaluation is for decision-making. Did the new prompt beat the old one on the labeled set; should we ship it; what regressed. The dataset is fixed so the comparison is attributable. The output is a scorecard delta with confidence intervals. Without an offline eval gating merges, the team is shipping vibes ^[2].

Monitoring is for situational awareness. Is the production distribution shifting; are failure modes the offline set does not contain showing up live; is the cost or latency budget being burnt. It cannot tell you whether to ship the next change, because there is no labeled baseline. It can tell you when the assumption underlying the last decision has broken.

Guardrails are for catastrophe prevention. PII redaction, policy refusal on certain intents, output-schema validation, prompt-injection detection at the boundary. The bar is "this specific response is unsafe, malformed, or non-compliant; do not return it." Guardrails are fast and conservative because they are on the critical path.

All three feed from the same trace schema: input, context, tool calls, intermediate states, final output, and metadata (model version, prompt version, latency, token count, cost). The same trace can be sampled into monitoring scores, pulled into the offline labeled set, and inspected after a guardrail-block event. Traces are the substrate of the entire eval program ^[3].

Design the trace store first. Bolting monitoring onto an eval pipeline that does not log traces consistently is harder than designing trace ingestion at the start.

The common failure modes

The team that builds monitoring first and skips evaluation has graphs that move down and no way to attribute the movement. Which prompt change caused the regression. The team cannot tell, because there is no baseline labeled set. The fix is an offline eval against a fixed dataset before the next prompt change ships.

The team that builds guardrails first and skips evaluation ships a system where every response passes the schema check, contains the policy disclaimer, and avoids the banned-word list. None of that tells the team whether the answer is right for the user. The fix is the same: a labeled set and an offline eval that scores semantic quality.

The team that builds offline evaluation and skips monitoring ships changes that pass on the dataset and break in production because the production distribution drifted away from the dataset. The fix is a sampled online judge on production traces that flags when offline and online disagree ^[1].

Where to start

A startup that has none of the three should build evaluation first, then guardrails, then monitoring. Evaluation gives you a baseline; without one, neither monitoring graphs nor guardrail-block counts are interpretable. Guardrails come second because they are the only layer that prevents user-facing harm. Monitoring comes third because it catches the distribution shifts the other two missed, and that signal is only useful once the team can act on it with a curated dataset and a working guardrail.