A regression set is not your eval set. It is a smaller, focused file whose only job is to catch the bug you already fixed when it comes back. Every team that has shipped AI for more than a quarter rediscovers this distinction the hard way: the general eval set passes, a release ships, and a user files the same complaint you closed two months ago. The fix is to write down the failing case the day you fix it and run it against every release forever.
What it is and is not
The general eval set probes the failure surface: balanced, sized for power, evolving. The regression set probes specific known failures: biased on purpose, unbalanced on purpose, append-mostly. Items get retired only when the underlying capability is structurally impossible to break.
The regression set runs in CI on every PR and should pass at a hundred percent. Not 98, not 99: a hundred. A flake here is a process failure, not a measurement failure. If the regression set is noisy enough that engineers learn to ignore red builds, you have a worse problem than the regressions you are trying to prevent [1].
What goes in
Three categories belong, and almost nothing else.
Reproduced production bugs. Every customer-visible failure gets a one-paragraph postmortem and a labeled case before the fix ships. The case carries the input that triggered the bug, the wrong output that shipped, the right output the fix produces, and a link to the postmortem [2].
Policy boundaries. Every refusal pattern your product is supposed to maintain (no medical advice, no competitor code, no PII to an unauthorized session) gets a small set of probes from inside and outside the boundary. These drift if you do not pin them.
Hard-won correctness invariants. The query that broke when you upgraded models last quarter. The two-turn conversation that surfaced an off-by-one in message-history truncation. The malformed JSON that crashed the tool router.
What does not go in: aspirational examples (if the product fails today, the regression set is not where it goes), examples that depend on live production state, and examples whose correct answer is debatable.
A capture workflow
The bottleneck is routine, not technology. Capture has to be cheaper than the engineer's tolerance for forgetting.
# capture_regression.py
import json, uuid
from datetime import date
from pathlib import Path
REGRESSION_PATH = Path("evals/data/regression/v1/cases.jsonl")
def capture(*, trace_id, input_payload, expected_output, bug_link, category, author):
record = {
"id": f"reg-{date.today().isoformat()}-{uuid.uuid4().hex[:6]}",
"trace_id": trace_id,
"input": input_payload,
"expected_output": expected_output,
"category": category, # "prod-bug" | "policy" | "invariant"
"bug_link": bug_link,
"captured_by": author,
"captured_at": date.today().isoformat(),
}
with REGRESSION_PATH.open("a") as f:
f.write(json.dumps(record, sort_keys=True) + "\n")
A failure produces the trace id, the bug link, and the diff between expected and actual output. The on-call engineer attributes the regression to the offending change in under a minute.
Tiering
Past a few hundred cases, the set starts to slow CI. Tier it: a smoke tier on every PR (top thirty), a full tier nightly and on release candidates, and an extended tier weekly with adversarial probes. Cases that have caught regressions in the last quarter belong in smoke; cases green for a year demote to nightly.
Versioning
The hashing discipline that applies to the general set applies here. Pin the regression hash to every release. When you intentionally retire a case, the migration note explains why; when a case starts failing because the product changed shape, you bump a version and document the intentional change. The bias toward the past is a feature, but it should not become a brake on legitimate product evolution.
What good looks like
A team that has internalized regression discipline ships a postmortem and a regression case together. Every bug closed last quarter has a labeled case in evals/data/regression/. CI green means both general and regression sets passed. Customer complaints about previously-fixed bugs are rare; when one slips through, the response is a new regression case before the chat ticket is closed [3] [4].