AI Evals › Techniques

Rubric design

Binary versus Likert, multi-dimensional decomposition, criteria drift, the case against 1-5 scales, the case for written critiques.

A rubric is the falsifiable definition of pass. Most eval programs that fail in the first month fail because the rubric was a vibe, not a definition. The discipline is small (five rules) and the payoff is large.

Binary, not Likert

Use binary PASS or FAIL with a written critique, not a 1-to-5 or 1-to-10 Likert scale. Hamel's line: "1-5 scales are often a sign of a bad eval process" because they collapse the rubric author's job (define what wrong looks like) into the rater's job (pick a middling number) ¹. Binary wins on three measurable axes.

Sample size. A binary classifier reaches a given CI width at roughly one fourth the N of a Likert estimator; Bernoulli variance at p=0.5 is fixed and small (0.25), Likert variance at the mean is larger.

Inter-rater agreement. Binary forces both raters to land on the same side of a clean cutoff; Likert lets them slide. A rubric that hit kappa 0.85 on binary often hits kappa 0.55 on the same content in 1-5 form.

Specificity. Writing PASS and FAIL forces the author to specify what failure looks like; writing 1-5 lets the author punt the boundary cases ("a 3 is when it's mostly OK"). The judge then defaults to the middle and the calibration runs aground.

The exception that proves the rule. If a downstream system consumes a continuous score (an RLHF reward, a ranking model trained on preferences), the upstream eval can collect Likert. For an eval program whose output is a release-gate decision, the decision is binary and the rubric should be too.

Multi-dimensional means multiple binaries

A customer-support reply has to be faithful to the policy, concise, and on-brand. The right rubric is three binaries, not one 1-to-5 ². Each dimension gets its own PASS or FAIL criterion, few-shot examples, and confusion matrix. The system-level pass rate is a product (or weighted product) of the per-dimension rates.

Pattern	What it scores	When to use
Single binary	one dimension	one dimension dominates
Multiple binaries	N independent dimensions	recommended default
1-5 Likert	scalar quality	almost never
1-5 per dimension	N scalars	inherits Likert problems N times
Pairwise preference	overall preference	see pairwise

The three-binary pattern also makes failure attribution clean. A drop in faithfulness points at retrieval, a drop in conciseness points at the prompt, a drop in tone points at the system message. A scalar collapses all three and the on-call cannot triage.

Criteria drift

Shankar et al. document a specific failure mode: the human's notion of pass shifts as they look at more outputs ³. The rubric on day one does not match the rubric on day thirty. The judge calibrated against day-one labels under-performs on day-thirty traces, and the program looks like the judge degraded when the truth signal drifted.

Three mitigations. Lock the rubric in a versioned file tied to the calibration set; every recalibration re-reads the rubric verbatim. Have the principal expert blind-label fifty fresh traces against the live rubric every quarter; disagreement above ten percent triggers a rubric revision. Treat rubric versions as artifacts; a judge calibrated against rubric v1 has TPR and TNR scoped to v1.

Write the critique alongside the label

A judge that returns only PASS or FAIL is debug-hostile. A judge that returns a one-sentence critique externalizes the criterion, speeds up calibration sessions (the expert reads the critique, not the full trace), and trains junior reviewers by example ⁴. The marginal cost is a few hundred tokens per judge call.

The critique should cite the specific element that drove the call ("the recipe uses cooked quinoa; raw vegan forbids cooking above 118 F"), not a generic restatement of the verdict.

Pin the schema

Sclar et al. show meaning-preserving format changes can move LLaMA accuracy by up to seventy-six points ⁵. For rubric design: "PASS or FAIL" versus "Yes or No" versus "1 or 0" are three different rubrics from the judge's perspective. Pick one and pin it. A strict JSON output schema does double duty: it makes the parser fail loudly on drift and pins the option labels through prompt edits.

The five-minute falsifiability check

Before committing a rubric:

Write PASS in one sentence.
Write FAIL in one sentence.
Write three boundary traces by hand: one obvious PASS, one obvious FAIL, one edge case.
Have the principal expert label all three independently.
If all three labels match the author's, the rubric is shippable. If any disagree, the rubric is not yet falsifiable.

IFEval is a clean example of falsifiable rubrics at scale: every criterion is verifiable by a deterministic script, with zero rater judgment required ⁶. The same discipline applies inside an LLM-judge rubric.

When to use a scalar after all

Three legitimate cases. The downstream consumer is an RLHF reward model (preference data with scalars trains stable reward heads). The eval is a ranking, not a release gate (Chatbot Arena fits scalars from binary pairwise preferences, not direct collection). The metric is genuinely continuous (token-overlap, embedding-similarity, latency, cost; covered on reference-based metrics and embedding similarity).

For a production judge program whose output is a release decision, none apply. Stick to binary, write the critique, pin the schema, version the rubric. The judge prompt structure chapter walks the template.

Hamel Husain, "Your AI Product Needs Evals." ↩
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩
Shankar et al., "Who Validates the Validators?" UIST 2024. ↩
Hamel Husain, "Creating an LLM-as-a-Judge That Drives Business Results." ↩
Sclar et al., format sensitivity, ICLR 2024. ↩
Zhou et al., IFEval, 2023. ↩