Spec generation agents

Evaluating spec and design-doc generation: SpecBench, ambiguity detection, EARS acceptance criteria, and spec2code downstream success.

What "good" looks like

The test of a spec generation agent is whether it catches the omissions, ambiguities, inconsistencies, and incorrect assumptions that real maintainers caught in review. Score the artifact against expert critiques, then confirm the score means something by handing the spec to a coding agent and measuring what builds.

SpecBench (2026) operationalizes the first half. It derives tasks from the RFC processes of Kubernetes, React, Rust, TVM, and vLLM: the agent receives an initial design proposal plus the codebase and is judged against the actual expert critiques those proposals drew in RFC history. The best agent scores 44.4%; more than half of what experienced maintainers caught goes unnoticed^[1]. Be plain about the state of the field: no canonical benchmark exists for free-form product specs. The reusable pattern is SpecBench's method, mining real review comments as ground truth and scoring recall against them.

This page owns the artifact between requirements and code. The implementation loop that consumes the spec is Autonomous coding; the quality of critiques on someone else's change is Code review.

Secondary metrics

Ambiguity detection as precision and recall. ReqEval provides 200 requirement sentences, roughly half ambiguous, with a cross-domain generalization test; detectors that look strong in-domain degrade when the domain shifts^[2]. Testability of acceptance criteria. Kiro's requirements.md format uses EARS-notation acceptance criteria because each clause maps to a test^[3]; score the fraction of criteria a test can actually be written against. Intrinsic completeness, consistency, and correctness rubric. Krishna et al. found GPT-4 SRS drafts comparable to entry-level engineers, and capable of giving constructive validation feedback on existing drafts^[4]. For the formal-spec angle, SpecEval scores JML-style specification generation; six SOTA LLMs rate below satisfactory and are sensitive to semantics-preserving perturbations. Read it for what it measures, code comprehension expressed through specs, not requirements authoring^[5].

Offline eval design

The extrinsic metric is the headline: feed the generated spec to a coding agent and measure build and test pass rate. Commit0 is the cleanest instrument. Agents implement 54 Python libraries from scratch from API specs plus interactive unit tests; no agent fully reproduces any library, and interactive feedback measurably raises pass rates^[6]. The direction of the arrow matters. SWE-bench measures issue-to-patch against an existing codebase^[7]; spec2code puts the spec itself under test. If the coding agent fails on the generated spec but succeeds on the human-written one for the same feature, the spec is the defect.

For the intrinsic half, build a golden-spec set from shipped specs plus their real review threads; start small and grow it as new specs ship. Run both metric families and track their correlation on your own data. Intrinsic rubric scores are cheap and tell you which section of the spec is weak; extrinsic pass rates are expensive and are the only ones that predict shipping. Where the intrinsic score climbs and the extrinsic rate does not, the rubric is being gamed.

Online eval design

Clarification rate first: does the agent flag underspecification instead of guessing? GitHub's Spec Kit operationalizes this with [NEEDS CLARIFICATION] markers in generated specs, plus /speckit.clarify and /speckit.checklist for resolving them and /speckit.analyze for cross-artifact consistency checks^[8]. Track marker density and resolution rate; an agent that never asks is guessing. Then measure downstream rework: the spec-change rate after implementation begins, and the share of implementation rework attributable to a spec defect rather than a coding defect. Sample specs that produced high rework for human review; the defect taxonomy (omission, ambiguity, inconsistency, wrong assumption) mirrors the offline rubric, so the two datasets compound.

Top benchmarks

SpecBench, https://arxiv.org/abs/2605.30314. RFC-derived design proposals from Kubernetes, React, Rust, TVM, and vLLM, judged against the real expert critiques from review history^[1].
Commit0, https://arxiv.org/abs/2412.01769. 54 Python libraries built from API specs plus interactive unit tests; the extrinsic spec2code instrument^[6].
ReqEval, https://nlp4re.github.io/2020/reqeval.html. 200 requirement sentences for ambiguity detection, with a cross-domain generalization split^[2].

Tools especially suited

GitHub Spec Kit for the operational loop: [NEEDS CLARIFICATION] markers, checklists, and cross-artifact analysis give you measurable events for free^[8]. Kiro for EARS-notation requirements that map one-to-one to tests, which makes the testability metric mechanical^[3]. The Commit0 harness for the spec2code extrinsic check. Inspect AI for rubric scoring with a judge from a different model family than the generator.

Common pitfalls

No golden set. Because no canonical benchmark for product specs exists, teams that skip building their own golden set end up with judge scores anchored to nothing; the shipped-spec-plus-review-thread set is the fix. Intrinsic scores standing in for downstream success. LLM-judge rubric scores do not predict build success; always pair them with the extrinsic pass rate, and treat divergence as the finding. Self-evaluation bias. When the same model writes and reviews the spec, scores inflate; the judge must come from a different model family, and periodic human calibration is not optional. Assuming ambiguity detectors transfer. ReqEval's cross-domain result is that they degrade^[2]; revalidate on sentences from your own domain before trusting the precision number.

Canonical reading

SpecBench is the canonical read because it solves the ground-truth problem that blocks this whole category: instead of asking a judge what a good design doc looks like, it asks whether the agent found what Kubernetes and Rust maintainers actually found^[1]. The 44.4% ceiling for the best agent is the honest baseline to beat, and the mining-real-reviews method is the part to copy for your own specs.

Day-1 starter

Collect 15 shipped specs along with their real review comments; the human-caught defects are your ground truth, no synthetic labels needed.
Have the agent critique or regenerate each spec, then score recall of the human-caught defects with a judge from a different model family.
Run one spec2code extrinsic check: hand a generated spec to your coding agent and score it against a hidden test suite the spec author never saw.

Cross-cutting links

Autonomous coding: the implementation loop that consumes the spec this page evaluates.
Code review: critique quality on code changes; the sibling capability.
Rubric design: building the completeness, consistency, and correctness rubric.
LLM-as-judge biases and mitigations: the self-evaluation bias this page warns about, treated in full.
SWE-bench Verified walkthrough: the coding-agent harness to reuse for spec2code checks.
Inspect AI: the harness for rubric scoring with judge and generator separated.

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.