SWE-bench Verified is the OpenAI-curated 500-issue subset of SWE-bench where the test cases and problem statements have been human-verified for clarity. It is the benchmark the major frontier labs report on for autonomous coding 1 2. This recipe will walk the minimum harness needed to run a 20-issue subset against a coding agent of your choice, log the patch and the tests, and report resolved-issue rate plus cost and trajectory length.
Planned outline
- Why Verified, not the original. The original SWE-bench has noisy tests and ambiguous problem statements; reporting numbers on it is misleading. Verified is the cleaned subset.
- Set up Docker images per repo. Each issue has a frozen environment. The harness pulls a prebuilt image per repo.
- Pick a scaffold. The benchmark measures the agent (model + scaffolding), not the model alone. Options: bare SDK loop, Aider, OpenHands, your in-house scaffold.
- Run the subset. 20 issues from the Verified set, single-attempt.
- Score with the official harness. Apply patch, run tests, record PASS/FAIL.
- Read the failures. Three patterns: patch does not apply, patch applies but tests still fail, patch applies and tests pass but introduces a new regression.
- Cost and trajectory length per issue. Token cost is the second-tier metric.
What this will skip in v1
Full 500-issue runs (too expensive for a tutorial). SWE-Lancer (separate recipe, deferred). Pass^k on SWE-bench (the literature reports pass@1; consistency on coding tasks is task-specific). Multi-SWE-bench across non-Python repos.
The full walkthrough is queued. For now, the autonomous-coding task type covers the conceptual framing and the canonical paper-to-numbers connection 3. The Anthropic SWE-bench Sonnet report is the right reading on why the scaffold dominates the model 2. The official site has the harness and the leaderboard 4.