AI Evals › Techniques

Guardrails versus evals architecture

Two concepts that share data and differ in latency, intent, and consequence. Where each fires in the request path, how they share traces, the mistake of building one and skipping the other.

Guardrails fire on every request and decide whether the response goes out. Evals fire on a sampled population and decide whether the system is meeting its quality bar. Huyen articulates the split: guardrails are catch-failures on the runtime path, evals are measurement on a schedule ¹. Both are necessary. Neither substitutes for the other.

Foundations carries the chapter-length treatment; the production plumbing lives in the production section.

What each is for

A guardrail is a runtime check with a binary consequence (pass or block). Synchronous on every request or response. On the critical path. Latency is part of the user-facing budget. Examples: PII redactor on input, toxicity classifier on output, JSON-schema validator, refund-policy regex.

An eval is a measurement of system quality on a sample. Asynchronous (offline on a benchmark, or online on sampled production traces). Does not block any individual response. Produces a score that informs human decisions (release-gate, drift alert, calibration update) ². Examples: a calibrated LLM judge scoring 5% of production traces for faithfulness, a nightly CI run on a 200-example regression set, an A/B with bootstrap intervals.

The shared substrate is the trace stream. Both read traces, both write to the trace store, both produce labels.

Property	Guardrail	Eval
When	every request	sampled, scheduled
Latency	tens to hundreds of ms	seconds to minutes
Consequence	block or pass	score, no block
Owner	platform or safety team	eval team
Failure mode	false positive blocks real users	sampling bias, miscalibration
Aggregate	per-request precision and recall	per-population pass rate with CI

Where each fires

A typical request path:

Input guardrail. PII redaction, jailbreak filter, schema validation on input.
Retrieval (RAG only).
Generation.
Output guardrail. Toxicity, factuality classifier, schema validation on output.
Response returned.
Trace write to store.
Async eval sampler picks a fraction.
Async judge scores the trace.
Aggregate metrics roll up.

Steps 1, 4, and 6 are the request path. Steps 7, 8, and 9 are the eval path. Different processes with different SLAs.

The three failure modes of confusing the two

Eval-as-guardrail. The team puts the LLM judge on the request path and blocks responses below a threshold. User-facing latency doubles, the judge's TPR/TNR become production constraints rather than calibration targets, and any judge drift becomes a user-visible reliability issue. Judge cost scales with traffic, not sampling.

Guardrail-as-eval. The team treats the guardrail pass-through rate as the quality metric. Guardrails only catch what they were designed to catch (PII, toxicity, schema violations); they say nothing about helpfulness, correctness, or brand voice. A 99.9% guardrail pass rate is consistent with a 60% quality pass rate.

Shared rubric, shared owner. The team writes one rubric used for both. The guardrail rubric needs to be cheap, deterministic, very low false-positive rate (FP rate translates directly to blocked users). The eval rubric needs to distinguish "merely acceptable" from "good." Trying to satisfy both kills the rubric for both. Shankar et al. document the same collapse pattern at the assertion level ³.

Architectural fix in all three: separation. Different rubrics, different latency budgets, different owners.

When each fires (the decision table)

Failure category	Guardrail	Eval	Why
PII leak in output	yes	also yes (trend)	catastrophic single failure justifies block
Toxic language	yes	also yes (trend)	same
Schema violation	yes	maybe	block prevents downstream errors
Off-brand tone	no	yes	subjective, judge-quality, no block warranted
Policy misquoting	yes	also yes	high-cost single failure
Faithfulness in RAG	maybe	yes	judge expensive, block hurts UX
Conciseness	no	yes	preference signal, never blocks
Sycophancy	no	yes	subtle, judge-only

Five high-stakes single-failure categories (PII, toxicity, schema, policy, hard safety) belong in the guardrail layer; everything quality-related belongs in the eval layer. Overlap rows are where teams confuse themselves; resolve by latency and consequence.

The guardrail layer and the eval layer share the trace stream; they do not share logic. The eval layer reads guardrail decisions as features (a guardrail-triggered trace is an interesting sample for open coding). The guardrail layer reads aggregated eval results as inputs to threshold tuning (judge identifies a faithfulness regression; team adds a faithfulness-specific guardrail with a higher threshold). Two pipelines, one trace store.

The guardrail layer in practice

Four classes of checks. Input sanitizers (PII redaction, prompt-injection detection, schema validation) at 10-50 ms. Input classifiers (small distilled models flagging restricted topics) at sub-100ms; OWASP LLM Top 10 is the reference categorization ⁴. Output classifiers (toxicity, factuality, schema) in parallel with stream output. Hard-coded rules (regex for forbidden promises, blocklist matches, response-length caps) at sub-millisecond. Combined p95 budget: 200-500 ms.

The eval layer in practice

Three loops at three frequencies. Synchronous offline (per release): CI runs the regression set and release-gate eval ⁵. Sampled online (continuous): async judging of 1-5% of traffic with daily aggregation; the online evaluation page covers mechanics. Weekly human cadence: open-coding loop on 10-20 fresh traces, failure-biased on guardrail-triggered or low-judge-score traces; the error analysis page covers the discipline. The three loops feed each other; Shankar's flywheel is the formalization ⁶.

What good looks like

A guardrail layer with clear ownership, measured per-request precision and recall, p95 latency under 500ms. An eval layer with separate ownership, measured per-population pass rate with intervals, asynchronous off the request path. A shared trace store with provenance fields that distinguish guardrail decisions from eval scores. Weekly human review on failure-biased samples from guardrail triggers and low-judge-score traces. A decision policy that adds a guardrail rule when the eval program surfaces a regression and adds an eval dimension when a guardrail rule starts producing many borderline cases.

Chip Huyen, "Building a Generative AI Platform." ↩
Hamel Husain, "Your AI Product Needs Evals." ↩
Shankar et al., "Who Validates the Validators?" UIST 2024. ↩
OWASP Top 10 for LLM Applications. ↩
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩
Shreya Shankar, "The AI Engineering Flywheel." ↩