Release gates from eval results

A release gate is a contract between the team and the user. Define the bands, automate the comparison, and hold the line when a regression candidate wants out.

A release gate is a written rule the team has agreed to follow before code reaches users. It says which eval results count, which deltas trigger which actions, and who can override. Without one, every release becomes a negotiation: the model owner argues the regression is small, the QA owner argues it is not, the EM sides with whoever is louder, and the user gets a worse product on Tuesday.

The opinion: bands, not thresholds. A binary "above 90% ships, below 90% does not" optimises for the wrong question because it pretends the eval score is more precise than the interval it carries. Bands acknowledge the interval and produce four actions (ship, hold, investigate, block) tied to where the candidate sits relative to the baseline.

The four-band gate

For each release-gated metric, define four bands relative to the current production baseline.

Ship. Delta interval excludes zero on the positive side, or is centred near zero with both endpoints within plus or minus the agreed indifference window. Action: merge.
Hold. Delta interval contains zero with one endpoint outside the indifference window. Action: investigate the slice that drove the move; either move to ship or move to block after the read.
Investigate. Delta interval excludes zero on the negative side but the magnitude is within the agreed minor-regression window (often 2 to 3 percentage points). Action: read fifty traces from the regressed slice; only ship with a written rationale.
Block. Delta interval excludes zero on the negative side and the magnitude exceeds the minor-regression window. Action: do not ship; revert or rescope.

The indifference window is the smallest delta the team has agreed is not worth blocking on, and is informed by the statistical power of the eval set. Setting it too tight makes the gate noisy; setting it too loose lets real regressions through.

Bands per axis

Quality is not the only axis. The same band structure applies to p95 latency, cost per request, refusal rate, and any other metric the release sheet tracks ¹. A candidate that wins quality and lands in the "block" band on cost is a regression candidate; the gate decision is the union of per-axis bands.

A useful default at the team level:

Quality: minor-regression window of 2 percentage points on the primary eval metric.
p95 latency: minor-regression window of 15% of the baseline.
Cost per request: minor-regression window of 10% of the baseline.
Refusal rate: any change requires investigation; over-refusal is as bad as under-refusal.

The team's product context will move these numbers; the structure stays.

What to compare against

The baseline is the current production system, not the previous merge. Each merge updates the baseline only after a sustained green online window (often two to four days). This prevents a chain of small "within-noise" merges from drifting the system away from the validated version without a single gate ever firing.

For per-slice comparison, use the same slices the dashboards report. A candidate that improves the global mean and regresses the highest-stakes slice is the most common silent failure of a global-only gate; the per-slice view catches it on the spot ².

Automating the comparison

"""
Decide a per-axis band from a delta and its interval against an
indifference window and a minor-regression window.
"""
from dataclasses import dataclass
from enum import Enum

class Band(Enum):
    SHIP = "ship"
    HOLD = "hold"
    INVESTIGATE = "investigate"
    BLOCK = "block"

@dataclass
class Bands:
    indifference: float
    minor_regression: float

def gate(delta: float, ci_low: float, ci_high: float, bands: Bands) -> Band:
    # Negative regression check
    if ci_high < 0:
        if -delta > bands.minor_regression:
            return Band.BLOCK
        return Band.INVESTIGATE
    # Positive or null
    if ci_low > 0:
        return Band.SHIP
    # Interval includes zero
    if max(abs(ci_low), abs(ci_high)) <= bands.indifference:
        return Band.SHIP
    return Band.HOLD

def gate_release(per_axis: dict[str, dict]) -> dict:
    """
    per_axis = {
      "quality": {"delta": ..., "ci": (low, high), "bands": Bands(...)},
      "latency_p95": {...},
      "cost": {...},
    }
    Returns the union: any BLOCK is BLOCK; else any INVESTIGATE is INVESTIGATE;
    else any HOLD is HOLD; else SHIP.
    """
    decisions = {
        axis: gate(v["delta"], v["ci"][0], v["ci"][1], v["bands"])
        for axis, v in per_axis.items()
    }
    severity = [Band.BLOCK, Band.INVESTIGATE, Band.HOLD, Band.SHIP]
    overall = min(decisions.values(), key=lambda b: severity.index(b))
    return {"per_axis": decisions, "overall": overall}

The gate is run from the PR check, the merge candidate, and the staging-to-production promotion. Same code, three call sites. The output is a per-axis band and the union; the union determines the action; the per-axis bands explain why.

Override discipline

Three rules for overrides that work in practice.

The first: every override carries a written rationale signed by a named individual, posted as a comment on the gate output. "I am overriding the block on cost because we are shipping a launch-week pricing change and will revert if the cost holds" is a fine rationale. "Approved" is not.

The second: overrides are logged in the same place the gate decisions are. A retrospective on "how many releases overrode the gate this quarter" should be a single query, not an archaeology dig.

The third: a sustained pattern of overrides on the same axis is a signal to tighten or loosen the bands, not to keep overriding. If 30% of releases override the cost block, the cost band is wrong for the team's product, and the fix is the band, not the override.

TIP

A release gate without override discipline becomes ceremony. The point of the gate is to make the negotiation explicit; making it skippable defeats the point ³.

The handoff to incident response

The gate is upstream of the runbook. When a regression slips past the gate (because the offline suite did not exercise the failing slice, because the indifference window was too wide, because the gate was overridden), the incident response process picks up where the gate stopped. The incident-response page covers that handoff.

The Anthropic Error Bars line of thinking is the closest published statement of why these gates need intervals, not point estimates ² ⁴. The Hamel pieces argue for the broader framing of evals as the merge gate at all ⁵ ⁶.

What to do this week

Pick the metrics that gate your next release. Write the indifference window and minor-regression window for each as numbers.
Implement the four-band gate above. Wire it into the same workflow as the CI eval results from the ci-integration page.
Add the override log. The gate decision and any override comment land in the same place, queryable by release tag.

Chip Huyen, "Building A Generative AI Platform" (2024). ↩
Anthropic, "A Statistical Approach to Model Evaluations." ↩ ↩²
Shreya Shankar, "The AI Engineering Flywheel." ↩
Miller, "Adding Error Bars to Evals" (Anthropic), arXiv:2411.00640. ↩
Hamel Husain, "Your AI Product Needs Evals." ↩
Hamel Husain and Shreya Shankar, "LLM Evals FAQ." ↩