What "good" looks like
The bar is resolved-issue rate on SWE-bench Verified, the 500-problem human-validated subset that OpenAI shipped in August 2024 to address noise and contamination concerns in the original SWE-bench[1]. A resolved issue is one where the model edits the right files and the hidden test patch passes against the model's diff. There is no partial credit. Anthropic's Claude 3.5 Sonnet report puts a minimal Bash+Edit agent at 49% on Verified, and the writeup is explicit that performance "varies significantly based on this scaffolding" around the model[2]. That single sentence is the reason this metric is hard to interpret in the wild: a 5% delta between two models can flip if you change the scaffold.
Treat the metric as a property of the agent, not the model. Report the scaffold alongside the score: which tools the model has, what the system prompt says, whether the agent gets one shot or multiple turns with test feedback. Most public leaderboard numbers are not directly comparable because the scaffolds differ.
Secondary metrics
Token cost per problem. SWE-bench Verified scores climb with longer trajectories; the ratio of tokens to resolved issues is the production signal. Trajectory length in turns. A 40-turn solution to a 5-turn problem hides a fragility you will see in real PRs. Regression rate, defined as a patch that fixes the target test but breaks a previously passing test elsewhere in the suite[3]. Sampled patch quality from human review on 5 to 10% of resolved issues; a passing patch can still be wrong-shaped, monkey-patched, or undo a prior fix.
Offline eval design
Run SWE-bench Verified end to end, then add SWE-Lancer Diamond from the OpenAI preparedness repo for higher-difficulty freelance-style tasks where the original benchmark saturates[4]. Both should be sealed: no patches leak into the training pipeline. On top of those two, build a "golden 50" from your team's recently merged PRs: capture the pre-merge state, the failing test, and the merge SHA so the harness can replay deterministically. The replay set is what makes the eval product-specific instead of leaderboard-shaped.
Score with the official SWE-bench harness for parity. Capture the trajectory for every attempt, not just pass/fail, so failures are debuggable later. Pin the model and the scaffold version; both contribute to the score and both move independently.
Online eval design
Log every patch attempt with a fixed schema: diff, files touched, tests added or modified, retry count, latency, tokens. Human-review 5 to 10% of accepted patches each week. Alert on three patterns. A regression spike, defined as week-over-week increase in patches that pass the target test but fail any previously passing test. The "passes tests but reverts a previous fix" pattern, detectable by git-blame analysis of removed lines. Cost-per-resolved-issue drift greater than 20% week over week, which usually indicates either an agent regression or a distribution shift in incoming issues.
A trajectory viewer beats a metrics dashboard. When the resolved rate drops, the on-call's first action should be replaying ten recent failures with full bash and edit history visible, not staring at a Grafana panel.
Top benchmarks
The official SWE-bench harness is the only way to get comparable Verified numbers. OpenHands and Aider are the two most common open-source agent scaffolds; both are worth comparing on your golden 50 because their tool surfaces differ. Inspect AI from UK AISI provides a research-grade harness with strong sandbox primitives and is the cleanest path for adding new evaluators alongside the standard ones. Braintrust or Langfuse handle trajectory logging in production, including per-step token accounting that the official harness does not capture.
Common pitfalls
Reporting headline SWE-bench instead of Verified. The original set is contaminated and noisy; numbers there mean less every release. Treating "tests pass" as ground truth. Hidden tests can be trivial; patches can monkey-patch tests or revert unrelated assertions. Sample for human review. Ignoring agent scaffolding when comparing models. The Sonnet writeup makes the point plainly: the same model, different scaffold, swings the score meaningfully[2]. Skipping the regression check. A patch that resolves the issue but breaks another test is a production incident, not a win.
Canonical reading
Anthropic's SWE-bench Sonnet report is the most useful single reading because it makes the scaffolding point explicit and shows a minimal Bash+Edit agent reaching competitive numbers without elaborate planning[2]. Read it alongside the original Jimenez et al. paper for the benchmark design[3] and the SWE-bench Verified site notes for what changed in the 500-problem subset[5].
Day-1 starter
- Pick 10 recently merged PRs from your repo. For each, save the pre-merge SHA, the failing test, and the human-authored diff as gold. Replay them as SWE-bench-style tasks via the official harness.
- Wire pass/fail to your CI on a small smoke subset (3 to 5 problems) per PR; run the full set nightly. Hamel's field guide framing applies directly: remove friction from looking at failed trajectories[6].
- Stand up a trajectory viewer. At minimum, the bash log and the unified diff for every attempt should be one click from the score. The viewer is the eval; the score is the headline.