Guardrails fire on every request and decide whether the response goes out. Evals fire on a sampled population and decide whether the system is meeting its quality bar. Huyen articulates the split: guardrails are catch-failures on the runtime path, evals are measurement on a schedule 1. Both are necessary. Neither substitutes for the other.
What each is for
A guardrail is a runtime check with a binary consequence (pass or block). Synchronous on every request or response. On the critical path. Latency is part of the user-facing budget. Examples: PII redactor on input, toxicity classifier on output, JSON-schema validator, refund-policy regex.
An eval is a measurement of system quality on a sample. Asynchronous (offline on a benchmark, or online on sampled production traces). Does not block any individual response. Produces a score that informs human decisions (release-gate, drift alert, calibration update) 2. Examples: a calibrated LLM judge scoring 5% of production traces for faithfulness, a nightly CI run on a 200-example regression set, an A/B with bootstrap intervals.
The shared substrate is the trace stream. Both read traces, both write to the trace store, both produce labels.
| Property | Guardrail | Eval |
|---|
| When | every request | sampled, scheduled |
| Latency | tens to hundreds of ms | seconds to minutes |
| Consequence | block or pass | score, no block |
| Owner | platform or safety team | eval team |
| Failure mode | false positive blocks real users | sampling bias, miscalibration |
| Aggregate | per-request precision and recall | per-population pass rate with CI |
Where each fires
A typical request path:
- Input guardrail. PII redaction, jailbreak filter, schema validation on input.
- Retrieval (RAG only).
- Generation.
- Output guardrail. Toxicity, factuality classifier, schema validation on output.
- Response returned.
- Trace write to store.
- Async eval sampler picks a fraction.
- Async judge scores the trace.
- Aggregate metrics roll up.
Steps 1, 4, and 6 are the request path. Steps 7, 8, and 9 are the eval path. Different processes with different SLAs.
The three failure modes of confusing the two
Eval-as-guardrail. The team puts the LLM judge on the request path and blocks responses below a threshold. User-facing latency doubles, the judge's TPR/TNR become production constraints rather than calibration targets, and any judge drift becomes a user-visible reliability issue. Judge cost scales with traffic, not sampling.
Guardrail-as-eval. The team treats the guardrail pass-through rate as the quality metric. Guardrails only catch what they were designed to catch (PII, toxicity, schema violations); they say nothing about helpfulness, correctness, or brand voice. A 99.9% guardrail pass rate is consistent with a 60% quality pass rate.
Shared rubric, shared owner. The team writes one rubric used for both. The guardrail rubric needs to be cheap, deterministic, very low false-positive rate (FP rate translates directly to blocked users). The eval rubric needs to distinguish "merely acceptable" from "good." Trying to satisfy both kills the rubric for both. Shankar et al. document the same collapse pattern at the assertion level 3.
Architectural fix in all three: separation. Different rubrics, different latency budgets, different owners.
When each fires (the decision table)
| Failure category | Guardrail | Eval | Why |
|---|
| PII leak in output | yes | also yes (trend) | catastrophic single failure justifies block |
| Toxic language | yes | also yes (trend) | same |
| Schema violation | yes | maybe | block prevents downstream errors |
| Off-brand tone | no | yes | subjective, judge-quality, no block warranted |
| Policy misquoting | yes | also yes | high-cost single failure |
| Faithfulness in RAG | maybe | yes | judge expensive, block hurts UX |
| Conciseness | no | yes | preference signal, never blocks |
| Sycophancy | no | yes | subtle, judge-only |
Five high-stakes single-failure categories (PII, toxicity, schema, policy, hard safety) belong in the guardrail layer; everything quality-related belongs in the eval layer. Overlap rows are where teams confuse themselves; resolve by latency and consequence.
Sharing data, not logic
The guardrail layer and the eval layer share the trace stream; they do not share logic. The eval layer reads guardrail decisions as features (a guardrail-triggered trace is an interesting sample for open coding). The guardrail layer reads aggregated eval results as inputs to threshold tuning (judge identifies a faithfulness regression; team adds a faithfulness-specific guardrail with a higher threshold). Two pipelines, one trace store.
The guardrail layer in practice
Four classes of checks. Input sanitizers (PII redaction, prompt-injection detection, schema validation) at 10-50 ms. Input classifiers (small distilled models flagging restricted topics) at sub-100ms; OWASP LLM Top 10 is the reference categorization 4. Output classifiers (toxicity, factuality, schema) in parallel with stream output. Hard-coded rules (regex for forbidden promises, blocklist matches, response-length caps) at sub-millisecond. Combined p95 budget: 200-500 ms.
The eval layer in practice
Three loops at three frequencies. Synchronous offline (per release): CI runs the regression set and release-gate eval 5. Sampled online (continuous): async judging of 1-5% of traffic with daily aggregation; the online evaluation page covers mechanics. Weekly human cadence: open-coding loop on 10-20 fresh traces, failure-biased on guardrail-triggered or low-judge-score traces; the error analysis page covers the discipline. The three loops feed each other; Shankar's flywheel is the formalization 6.
What good looks like
A guardrail layer with clear ownership, measured per-request precision and recall, p95 latency under 500ms. An eval layer with separate ownership, measured per-population pass rate with intervals, asynchronous off the request path. A shared trace store with provenance fields that distinguish guardrail decisions from eval scores. Weekly human review on failure-biased samples from guardrail triggers and low-judge-score traces. A decision policy that adds a guardrail rule when the eval program surfaces a regression and adds an eval dimension when a guardrail rule starts producing many borderline cases.