The eval maturity model

Five stages from vibe-checking to a continuous quality program. Where are you?

Most teams are at stage one and think they are at stage three. The point of the ladder is to pick the next rung honestly, not the one you wish you were on. Each rung produces a small, named artifact. If you cannot produce that artifact in a week, the rung is wrong for now.

The five rungs

Hamel's three levels (L1 assertions, L2 human-plus-judge on a labeled set, L3 production A/B) form the spine of the model ^[1]. Stage 0 sits below L1 as the honest description of where most teams start. Stage 4 sits above L3 as the continuous quality program Shankar describes ^[2].

flowchart TD
  S0["Stage 0\nVibe checking"]
  S1["Stage 1\nAssertions + small seed set\n(Hamel L1)"]
  S2["Stage 2\nLLM-judge on labeled dataset\n(Hamel L2)"]
  S3["Stage 3\nProduction A/B with online judge\n(Hamel L3)"]
  S4["Stage 4\nContinuous quality program\n(Shankar flywheel)"]
  S0 --> S1 --> S2 --> S3 --> S4
  S1 -. skip-ahead failure .-> S3
  S2 -. skip-ahead failure .-> S4

The dashed arrows are the common skip-ahead failures. A team at stage 1 cannot make a useful A/B test because they do not know what to measure. A team at stage 2 cannot run a continuous program because they have not decided which segments to sample and why. Each rung exists because the rung below it pays back the cost of the rung above.

Stage 0: Vibe-checking

The product manager clicks through ten examples, declares the new prompt better than the old prompt, and you ship. There is no dataset. There is no metric. There is no recorded baseline. Two weeks later nobody can tell whether the regression report is a real regression or a different prompt-format swing ^[3].

Stage 0 is the right rung for the first week of a prototype. It is the wrong rung the moment two people on the team disagree on what counts as broken.

Stage 1: Unit-test-style assertions

The first useful artifact is a tiny seed set with binary assertions. Thirty hand-curated examples. Three or four code-based checks that catch the obvious classes of failure: did the model output valid JSON, did it include the policy disclaimer, did it stay under the token cap, did it call the right tool with the right arguments. Hamel calls these Level 1; they are "unit tests for AI" ^[1].

The mistake at stage 1 is reaching for a judge before the assertions exist. If json.loads would catch a third of your failures, write json.loads. The judge is for the cases code cannot reach.

import json
from typing import Callable

# A stage-1 eval is a list of cheap, deterministic checks plus a tiny seed set.

def is_valid_json(text: str) -> bool:
    try:
        json.loads(text)
        return True
    except json.JSONDecodeError:
        return False

def has_policy_footer(text: str) -> bool:
    return "Reply STOP to opt out" in text

def no_markdown_in_sms(text: str) -> bool:
    # SMS does not render markdown. Reject responses that contain it.
    return not any(c in text for c in ["**", "## ", "* ", "_ "])

ASSERTIONS: dict[str, Callable[[str], bool]] = {
    "valid_json": is_valid_json,
    "policy_footer": has_policy_footer,
    "no_markdown_in_sms": no_markdown_in_sms,
}

def run_stage_1(seed: list[dict]) -> dict:
    """Run every assertion against every example. Return per-assertion pass-rate."""
    results = {name: 0 for name in ASSERTIONS}
    for example in seed:
        output = example["output"]
        for name, check in ASSERTIONS.items():
            if check(output):
                results[name] += 1
    return {name: count / len(seed) for name, count in results.items()}

Stage 2: LLM-judge on a labeled set

When the failures stop being mechanical, the judgment has to come from a human or an LLM that has been calibrated to a human. Stage 2 is one hundred labeled traces, a judge prompt, and a documented agreement rate between the judge and the principal domain expert ^[4].

The rubric is binary, the judge model is at least one tier larger than the generator, and the calibration loop runs until the judge's true-positive rate and true-negative rate on the held-out labels are each above 0.9 ^[5]. If you cannot get there in three iterations, the rubric is the bug, not the judge.

Stage 2 is also where dataset versioning starts to matter. If you relabel three borderline cases between the run that showed regression and the run that showed improvement, the improvement is in your labeling, not in the model.

Stage 3: Production A/B with an online judge

By stage 3 you have a calibrated offline judge, a labeled dataset that you maintain on a cadence, and the next step is to learn what offline cannot teach you: how the system behaves on inputs you did not seed. The online judge runs on a sampled fraction of production traces. Huyen frames it as: offline catches the regressions you can write down, online catches the ones you cannot ^[6].

Stage 3 needs three things stage 2 did not: a sampling policy (uniform vs oversample-on-failure vs oversample-on-segment), a way to attribute scores to model and prompt versions, and a release-gate decision rule. The decision rule is where most teams' eval programs fall apart. A 2-point quality bump that costs 30% more in tokens is not a win; cost and latency belong in the scorecard ^[6].

Stage 4: Continuous quality program

Stage 4 is the flywheel: trace logging, error analysis on a cadence, dataset refresh, judge recalibration, release gates, and postmortems all referencing the same metric. Shankar's framing is that the eval set itself is a versioned artifact that evolves with the product, and that production examples flow back into the labeled set on a defined cadence ^[2].

At stage 4 the eval program has a sustaining team, a budget line, and a written charter. The artifacts are: a dataset catalog, a judge registry, a calibration log, a release-gate policy, and a published quarterly readout. If you are pre-Series-B and pre-product-market-fit, you do not need stage 4 yet. If you are a public company with regulators in the loop, you needed it last year.

Self-assessment checklist

Run this ten-item checklist honestly. Each yes counts one point. The score maps to the rung you can defend in a room.

#	Question	Yes
1	We have a labeled dataset of at least 30 examples checked into git.
2	Every prompt change is run against that dataset before merge.
3	At least three deterministic assertions run against every output.
4	We have an LLM judge with a documented binary rubric.
5	We have a measured agreement rate between the judge and a human expert.
6	The dataset is content-hashed so we can pin results to a version.
7	The judge runs on a sampled fraction of production traces.
8	Cost per request and p95 latency are columns on the same scorecard as quality.
9	A release-gate policy specifies what scorecard movement blocks rollout.
10	Error analysis runs on a defined cadence and the rubric is on the table.

Score 0-1: stage 0. Score 2-3: stage 1. Score 4-6: stage 2. Score 7-8: stage 3. Score 9-10: stage 4.

The most common honest result is 2 or 3. Most teams that self-report stage 3 score 4 on this checklist. The most actionable score is the lowest no answer; that is the next rung.

The smallest useful artifact at each rung

If you cannot produce the artifact at the cost listed in a week, the rung is wrong for now. The cost estimate assumes a small team with the principal domain expert reachable on Slack ^[4].

0 -> 1   30 labeled examples, 3 assertions, a CI hook              (1 person-week)
1 -> 2   100 traces, a judge prompt, TPR/TNR vs 1 human            (2 person-weeks)
2 -> 3   sampling policy, online judge, weekly drift report        (4 person-weeks)
3 -> 4   dataset versioning, release gates, charter, postmortem    (1 quarter)

The geometric blow-up is real. Stage 0 to stage 1 is a week. Stage 3 to stage 4 is a quarter. The mistake is not the scope. The mistake is skipping rungs to chase the appearance of maturity.

CAUTION

A team that skips from stage 1 to stage 3 will spend the savings, and then some, debugging an unmeasurable system that nobody can roll back with confidence.

Where to invest next

Pick the next rung by the smallest artifact that would change a decision you are making this quarter.

If you are at stage 1 and about to swap models, you need stage 2 first. Without a calibrated judge on a labeled set, the model swap will look like a 4-point Wilson-95 movement in either direction and you will not be able to tell whether to ship ^[7].

If you are at stage 2 and seeing regressions reported by users that your offline eval missed, you need stage 3, not a smarter judge. The judge is missing the regressions because the dataset is missing the inputs. Online sampling is the way the dataset learns about the inputs you did not anticipate.

If you are at stage 3 and the eval team is bottlenecking every release, you need stage 4, not more judges. The bottleneck is governance, not throughput. A written release-gate policy unblocks half the conversations the team is having.

TIP

Pick the rung where the next decision is bottlenecked on a missing artifact, not on a missing dashboard.

The point of the model is to keep you honest about the cost. Eval programs that try to land at stage 4 in one quarter have a known failure mode: a half-built stage 2 that nobody trusts, a half-built stage 3 that drifts silently, and an audit binder that does not match the running system. Land at stage 1 properly. Then land at stage 2 properly. Each rung is its own win.

What gets cut

The corollary to picking the next rung is that everything above it gets postponed. The list of things to skip is short and specific.

Skip the dashboard before stage 2. A dashboard without a dataset and a judge is decoration. The graphs will go down and you will not know whether the model got worse or someone changed the input format ^[8].

Skip the multi-judge ensemble before stage 3. A single calibrated judge with documented TPR/TNR is more useful than three uncalibrated judges in a panel. Ensembles compound bias when the bases are biased the same way.

Skip the governance framework before the loop is closed. A risk register that does not reference a measured metric is a Word document, not a control. The governance section of this site covers what changes when you are at stage 4 and a regulator is the audience. Until then, the audience is the team, and the team needs a working judge before it needs a written policy.

The maturity model is not aspirational. It is a sequencing constraint.