A frontier judge on every production trace is fast to design and slow to operate. Tokens add up, p95 latency creeps, and the bill at one hundred thousand requests per day is rarely defensible. The standard production answer is to distill: train a small, task-specific judge against frontier-judge labels plus human gold, run the small model inline, and reserve the frontier judge for a sampled calibration batch [1].
Why distillation matters in production
Three cost vectors push teams toward distilled judges.
Tokens per request. A frontier judge consumes input tokens (trace, rubric, few-shot examples) and output tokens (verdict and critique). At meaningful volume the eval bill rivals the generation bill.
Latency. An inline guardrail-style judge has to return inside the request budget; a frontier-judge round-trip is usually too slow. A distilled judge that fits on a smaller GPU returns in tens of milliseconds.
Rate limits. Provider quotas cap frontier-judge calls per minute. A distilled judge runs on infrastructure the team controls [2].
The shipped judges to know
By 2026 a small set of shipped distilled judges have public benchmarks. Knowing them by name shortens the build-versus-buy conversation.
Prometheus 2. Kim et al. trained an open thirteen-billion-parameter judge that rivals GPT-4 on custom-rubric judgment and supports both direct and pairwise scoring in the same weights [3]. The first Prometheus paper's headline finding was that providing a reference is the single largest factor in judge quality [4]; Prometheus 2 carries that forward and merges direct-plus-pairwise capabilities through model-weight averaging.
Auto-J. Li et al. released an open thirteen-billion-parameter judge trained to produce natural-language critiques alongside verdicts [5]. The critique discipline makes the judge inspectable; a verdict without a critique cannot be audited without rerunning the trace.
JudgeLM. Zhu et al. fine-tuned a judge with explicit bias-controlled augmentation (reference-support and reference-drop) and report JudgeLM-7B can judge five thousand samples in three minutes on eight A100s [6]. The throughput figure is the operational claim a production team cares about.
Vendors ship proprietary distilled judges (Galileo Luna, Patronus Lynx, Vectara HHEM) with similar cost-and-latency profiles. They trade inspectability for a hosted runtime and a fixed taxonomy of failure modes. Inspect the disagreement shape carefully; a judge trained on a fixed taxonomy regresses to a default score outside its training distribution.
The inline plus calibration plus gold-set architecture
The pattern that holds up has three layers running together [1].
The inline layer is the distilled judge on every trace. It returns a verdict and a confidence score in tens of milliseconds. High-confidence verdicts are logged; low-confidence verdicts are flagged for the next layer.
The calibration layer is the frontier judge on a sampled fraction (one to five percent) of traces, plus all low-confidence flags. Disagreement between the two layers is logged as a calibration signal that feeds the next distilled-judge retrain.
The gold set is the human-labeled baseline both judges are calibrated against. Without it, the team is comparing the small judge to the big judge and assuming the big judge is right; the gold set breaks that circularity [1].
def judge_trace(trace, inline, frontier, threshold=0.7):
"""Inline distilled judge with frontier fallback on low confidence."""
verdict, confidence = inline.score(trace)
if confidence >= threshold:
return {"verdict": verdict, "source": "inline", "confidence": confidence}
# Cost is bounded by the fraction of traces that fall below the threshold.
frontier_verdict = frontier.score(trace)
return {
"verdict": frontier_verdict,
"source": "frontier",
"inline_disagreed": frontier_verdict != verdict,
"confidence": confidence,
}
The threshold is the lever. Lower routes more traces to the frontier judge and raises cost; higher keeps cost down but risks accepting more low-confidence inline verdicts. Tune against the gold set: pick the value where the inline-frontier disagreement rate above the threshold drops below the team's accepted error rate.
When not to use a distilled judge
It fails when failure modes are outside the training distribution. A judge trained on a fixed taxonomy (groundedness, refusal, instruction-following, toxicity, PII, off-topic) regresses to a default score on anything novel [3]. It fails on hard reasoning, math, or code correctness: inheriting frontier-judge labels does not give the small model capability the teacher did not have; use executable verification. It fails when the team has not built a gold set; both layers drift together and the team has no anchor [2].
What to ship first
Ship the gold set first (one hundred human-labeled traces minimum). Then the frontier judge with TPR and TNR above 0.9 [1]. Then the inline distilled judge calibrated against the same gold set. The slow pipeline that measures a known thing is the one that scales.