Autonomous coding

Given an issue and a repo, autonomously edit files until hidden tests pass. Execution-grounded scoring on SWE-bench Verified is the bar; agent scaffolding matters as much as the model.

What "good" looks like

Autonomous coding is scored by execution: the agent edits files in a repo, and a hidden test patch either passes against its diff or does not. The bar is resolved-issue rate on SWE-bench Verified, the 500-problem human-validated subset that OpenAI shipped in August 2024 to address noise and contamination concerns in the original SWE-bench^[1].

The ship targets:

A resolved issue edits the right files and passes the hidden test patch against the model's diff. There is no partial credit.
A minimal Bash+Edit agent reaches 49% on Verified in Anthropic's Claude 3.5 Sonnet report^[2].
Report the scaffold with the score: a 5% delta between two models can flip if you change the scaffold.

The score is a property of the agent, not the model. The Sonnet writeup is explicit that performance "varies significantly based on this scaffolding" around the model^[2]. That single sentence is why this metric is hard to interpret in the wild.

So report the scaffold with the score: which tools the model has, what the system prompt says, whether the agent gets one shot or multiple turns with test feedback. Most public leaderboard numbers are not directly comparable because the scaffolds differ.

One caveat on Verified itself: frontier agents are approaching saturation on it. Check swebench.com for live standings rather than a number quoted in a blog post, and know that SWE-bench Pro exists precisely because of the contamination and saturation concerns around Verified.

flowchart TD
    ISSUE["Issue + repo"] --> AGENT["Agent: model + scaffold"]
    AGENT --> DIFF["Model diff"]
    DIFF --> HIDDEN{"Hidden test patch passes?"}
    HIDDEN -->|no| FAIL["Not resolved: no partial credit"]
    HIDDEN -->|yes| REG{"Any previously passing test broken?"}
    REG -->|yes| INCIDENT["Regression: a production incident, not a win"]
    REG -->|no| REVIEW["Human-review 5 to 10% of resolved patches"]
    REVIEW --> SHIP["Report resolved rate + scaffold version"]

Secondary metrics

Metric	What it catches	How to score
Token cost per problem	Scores that climb only with longer trajectories	Ratio of tokens to resolved issues, the production signal
Trajectory length in turns	Fragility hidden by a 40-turn solution to a 5-turn problem	Count turns per resolved issue
Regression rate	A patch that fixes the target test but breaks a previously passing test elsewhere in the suite^[3]	Re-run the full suite after the patch
Sampled patch quality	Passing patches that are wrong-shaped, monkey-patched, or undo a prior fix	Human review on 5 to 10% of resolved issues
Reliability over trials	Runs that succeed once but not repeatably	pass^k from tau-bench: probability all k independent attempts succeed^[4]
Long-horizon capability	Whether the agent handles longer tasks over time	METR's 50%-task-completion time horizon, doubling roughly every 7 months^[5]

Offline eval design

Run the benchmarks sealed. Run SWE-bench Verified end to end, then add SWE-Lancer Diamond from the OpenAI preparedness repo for higher-difficulty freelance-style tasks where the original benchmark saturates^[6]. No patches leak into the training pipeline.

Build a "golden 50" from your team's recently merged PRs. Capture the pre-merge state, the failing test, and the merge SHA so the harness can replay deterministically. The replay set is what makes the eval product-specific instead of leaderboard-shaped.

Score with the official SWE-bench harness for parity. Capture the trajectory for every attempt, not just pass/fail, so failures are debuggable later. Pin the model and the scaffold version; both contribute to the score and both move independently.

Online eval design

Log every patch attempt with a fixed schema: diff, files touched, tests added or modified, retry count, latency, tokens. Human-review 5 to 10% of accepted patches each week.

Alert on three patterns:

A regression spike, defined as a week-over-week increase in patches that pass the target test but fail any previously passing test.
The "passes tests but reverts a previous fix" pattern, detectable by git-blame analysis of removed lines.
Cost-per-resolved-issue drift greater than 20% week over week, which usually indicates either an agent regression or a distribution shift in incoming issues.

A trajectory viewer beats a metrics dashboard. When the resolved rate drops, the on-call's first action should be replaying ten recent failures with full bash and edit history visible, not a Grafana panel.

Top benchmarks

SWE-bench Verified, https://www.swebench.com/verified.html. 500 human-vetted problems; the public standard for ranking coding agents.
SWE-bench (original), https://www.swebench.com/. 2,294 problems from 12 Python repos; noisier and contaminated, used only when historical comparison demands it.
SWE-Lancer Diamond, https://github.com/openai/preparedness. 1,400+ Upwork-sourced freelance tasks; harder distribution than Verified.
SWE-bench Pro, https://labs.scale.com/leaderboard/swe_bench_pro_public. The contamination-resistant successor: 1,865 problems from 41 actively maintained repos, split into public, held-out, and commercial sets. Deliberately harder and less contaminated than Verified^[7].
Terminal-Bench, https://www.tbench.ai/. Agent tasks in a real terminal environment; the hard end includes kernel builds and server configuration, which stress a different tool surface than repo editing^[8].
Commit0, https://arxiv.org/abs/2412.01769. Greenfield generation rather than issue fixing: implement 54 Python libraries from scratch against API specs and interactive unit tests. No agent fully reproduces any library, and interactive test feedback measurably raises pass rates^[9].

Tools especially suited

SWE-bench harness: the only way to get comparable Verified numbers.
OpenHands: one of the two most common open-source agent scaffolds; worth comparing on your golden 50 because tool surfaces differ.
Aider: the other common open-source scaffold to compare on your golden 50.
Inspect AI: UK AISI's research-grade harness with strong sandbox primitives; the cleanest path for adding new evaluators alongside the standard ones.
Braintrust or Langfuse: production trajectory logging, including per-step token accounting the official harness does not capture.

Common pitfalls

Pitfall	Why it bites	Fix
Reporting headline SWE-bench instead of Verified	The original set is contaminated and noisy; numbers there mean less every release	Rank on the 500-problem Verified subset
Treating "tests pass" as ground truth	Hidden tests can be trivial; patches can monkey-patch tests or revert unrelated assertions	Sample for human review
Ignoring agent scaffolding when comparing models	The same model, different scaffold, swings the score meaningfully^[2]	Report the scaffold and pin its version
Skipping the regression check	A patch that resolves the issue but breaks another test is a production incident, not a win	Re-run the full suite and alert on regressions

Canonical reading

Anthropic's SWE-bench Sonnet report is the most useful single reading because it makes the scaffolding point explicit and shows a minimal Bash+Edit agent reaching competitive numbers without elaborate planning^[2]. Read it alongside the original Jimenez et al. paper for the benchmark design^[3] and the SWE-bench Verified site notes for what changed in the 500-problem subset^[10].

Day-1 starter

Pick 10 recently merged PRs from your repo. For each, save the pre-merge SHA, the failing test, and the human-authored diff as gold. Replay them as SWE-bench-style tasks via the official harness.
Wire pass/fail to your CI on a small smoke subset (3 to 5 problems) per PR; run the full set nightly. Hamel's field guide framing applies directly: remove friction from looking at failed trajectories^[11].
Stand up a trajectory viewer. At minimum, the bash log and the unified diff for every attempt should be one click from the score. The viewer is the eval; the score is the headline.

Cross-cutting links

Trajectory-level evaluation: scoring the steps that produced the diff, not just the diff.
Wiring evals into CI: the smoke-subset-per-PR, full-set-nightly pattern in detail.
Agent trajectory evaluation: the technique reference behind the trajectory viewer.
Inspect AI: the research-grade harness with the sandbox primitives coding agents need.
Braintrust: trajectory logging with per-step token accounting.
SWE-bench Verified walkthrough: the day-1 starter as a runnable recipe.
Spec generation agents: the artifact upstream of the implementation loop.
Agent memory: reliability across sessions and trials.