AI Evals › Cookbook

SWE-bench Verified walkthrough

Run a repo-stratified 25-instance SWE-bench Verified subset with mini-SWE-agent, verify patches with the official Docker harness, and read the report without over-claiming.

SWE-bench Verified is the 500-instance subset of SWE-bench that OpenAI had 93 professional Python developers screen for quality: clear problem statements, fair tests, solvable tasks ¹. It is the number every frontier lab reports for autonomous coding. This recipe runs a 25-instance subset end to end: harness sanity check, defensible instance selection, patch generation with mini-SWE-agent, pass verification with the official Docker harness, and a sober read of what the resulting number means.

The pipeline has two halves that people constantly conflate. The agent half generates a patch per issue. The harness half applies each patch in a frozen Docker environment and runs the repo's tests. Only the harness half decides resolved or unresolved. Any setup where the agent grades itself is not a SWE-bench number.

Why Verified, not the original

The original SWE-bench scraped 2,294 issue-plus-patch pairs from 12 Python repos; the best model at publication, Claude 2, resolved 1.96% ². The scrape quality was mixed. When OpenAI's annotators reviewed a 1,699-sample slice, 38.3% of samples got flagged for underspecified problem statements and 61.1% for unit tests that can reject valid solutions ¹. On the original set, a wrong-looking failure is often the benchmark's fault. On Verified it is almost always your agent's fault, which is the property you want when you are debugging a scaffold. GPT-4o's resolved rate doubled from 16% to 33.2% just by removing broken instances ¹.

One framing to keep in mind throughout: the score belongs to the model plus the scaffold. Anthropic reported 49% on Verified by redesigning the scaffold around a then-current Sonnet, beating systems built on the same class of model ³. You are benchmarking your harness choices as much as the model.

Step 1: install the harness and validate Docker with a gold patch

The official harness builds a Docker image per instance and runs the tests inside it. Before spending API budget, prove the Docker half works using a gold patch (the repo maintainer's actual fix, which must evaluate to resolved).

git clone https://github.com/SWE-bench/SWE-bench.git
cd SWE-bench
pip install -e .

python -m swebench.harness.run_evaluation \
    --predictions_path gold \
    --max_workers 1 \
    --instance_ids sympy__sympy-20590 \
    --run_id validate-gold

Expected output ends with:

Total instances: 1
Instances submitted: 1
Instances completed: 1
Instances resolved: 1
Report written to gold.validate-gold.json

If resolved: 1 does not appear, fix your Docker setup before going further. On Apple silicon, add --namespace '' so images build locally instead of pulling x86 images ⁴. The harness maintainers recommend an x86_64 machine with 120GB free disk, 16GB RAM, and 8 CPU cores for full runs; a 25-instance subset needs a fraction of the disk, but images accumulate, so check docker system df after runs ⁴.

Step 2: pick a subset with a defensible shape

Verified is not uniform. The repo distribution is skewed: django alone is 231 of 500 instances (46%), sympy is 75, and the tail (flask, seaborn, requests) is single digits ⁵. "First 25 by instance id" gives you an astropy-and-django benchmark. Sample proportionally by repo instead, with a pinned seed so the subset is reproducible:

"""Pick a repo-stratified 25-instance subset of SWE-bench Verified.

Run:
    pip install datasets pandas
    python make_subset.py --n 25 --seed 42
"""
from __future__ import annotations

import argparse

import pandas as pd
from datasets import load_dataset


def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("--n", type=int, default=25)
    p.add_argument("--seed", type=int, default=42)
    args = p.parse_args()

    ds = load_dataset("princeton-nlp/SWE-bench_Verified", split="test")
    df = ds.to_pandas()[["instance_id", "repo"]]

    # Proportional allocation per repo, largest-remainder rounding.
    quota = df["repo"].value_counts(normalize=True) * args.n
    alloc = quota.astype(int)
    short = args.n - alloc.sum()
    alloc[(quota - alloc).nlargest(short).index] += 1

    picks = [
        df[df["repo"] == repo].sample(n=k, random_state=args.seed)
        for repo, k in alloc[alloc > 0].items()
    ]
    subset = pd.concat(picks).sort_values("instance_id")

    subset["instance_id"].to_csv("subset_ids.txt", index=False, header=False)
    regex = "^(" + "|".join(subset["instance_id"]) + ")$"
    with open("subset_filter.txt", "w") as f:
        f.write(regex)

    print(alloc[alloc > 0].to_string())
    print(f"\n{len(subset)} instances -> subset_ids.txt, subset_filter.txt")


if __name__ == "__main__":
    main()

Expected allocation at n=25, seed 42:

repo
django/django                12
sympy/sympy                   4
sphinx-doc/sphinx             2
matplotlib/matplotlib         2
scikit-learn/scikit-learn     2
astropy/astropy               1
pydata/xarray                 1
pytest-dev/pytest             1

25 instances -> subset_ids.txt, subset_filter.txt

Commit subset_ids.txt next to your results. A subset number without the subset definition is unreviewable.

Step 3: generate patches with mini-SWE-agent

You need an agent that reads an issue, edits a checkout, and emits a diff. Bring your own if you have one. If not, mini-SWE-agent is the right baseline: about 100 lines of agent logic, bash-only tooling, maintained by the SWE-bench/SWE-agent group, and strong enough that frontier labs use it for reporting ⁶.

pip install mini-swe-agent
export ANTHROPIC_API_KEY=...     # or OPENAI_API_KEY for openai/* models
export MSWEA_GLOBAL_COST_LIMIT=60  # hard stop for the whole batch, USD

mini-extra swebench \
    --model anthropic/claude-sonnet-4-5-20250929 \
    --subset verified \
    --split test \
    --filter "$(cat subset_filter.txt)" \
    --workers 4 \
    --output runs/verified-25

Model ids are litellm format, so openai/gpt-... and friends work unchanged ⁶. The shipped swebench.yaml config caps each instance at 250 steps and $3 of API spend; the global limit above is your batch-level backstop. The run writes runs/verified-25/preds.json plus one <instance_id>.traj.json trajectory per instance. Interrupting with Ctrl-C is safe; re-running the same command resumes from preds.json.

preds.json is the official predictions format, one record per instance:

{
  "django__django-11099": {
    "instance_id": "django__django-11099",
    "model_name_or_path": "anthropic/claude-sonnet-4-5-20250929",
    "model_patch": "diff --git a/django/contrib/auth/validators.py ..."
  }
}

If you bring your own agent, emit exactly these three fields per instance and the rest of this recipe is unchanged.

Step 4: verify passes with the official harness

cd SWE-bench
python -m swebench.harness.run_evaluation \
    --dataset_name princeton-nlp/SWE-bench_Verified \
    --predictions_path ../runs/verified-25/preds.json \
    --max_workers 6 \
    --run_id verified-25

Each worker checks out the repo at the issue's base commit inside Docker, applies your patch, and runs two test sets defined by the benchmark: FAIL_TO_PASS (tests that fail before the gold fix and must pass after) and PASS_TO_PASS (tests that already passed and must keep passing) ². An instance is resolved only if both sets are green. If you have no Docker capacity, the maintainers run a free cloud evaluator: sb-cli submit swe-bench_verified test --predictions_path preds.json --run_id verified-25 returns results in about twenty minutes ⁷.

Step 5: read the report

The console summary and a report JSON land when the run finishes:

Total instances: 25
Instances submitted: 25
Instances completed: 25
Instances incomplete: 0
Instances resolved: 10
Instances unresolved: 15
Instances with empty patches: 0
Instances with errors: 0
Report written to anthropic__claude-sonnet-4-5-20250929.verified-25.json

The JSON carries the same counts plus the id lists (resolved_ids, unresolved_ids, error_ids, empty_patch_ids). The per-instance evidence is under logs/run_evaluation/verified-25/<model>/<instance_id>/: patch.diff (what was applied), test_output.txt (raw pytest output), report.json (which FAIL_TO_PASS and PASS_TO_PASS tests passed), and run_instance.log.

Read every unresolved instance. They cluster into three patterns, and the fix differs for each:

Patch does not apply. run_instance.log shows a git apply failure. The agent hallucinated file paths or line context. Scaffold bug; tighten how the agent views files before editing.
Patch applies, FAIL_TO_PASS still fails. The agent fixed a plausible-looking bug that is not the reported bug, or fixed one of three call sites. This is the dominant bucket and the one worth reading trajectory by trajectory.
FAIL_TO_PASS passes, PASS_TO_PASS breaks. The fix is a regression. Agents that edit aggressively without running the existing suite land here; consider making the agent run nearby tests before submitting.

Empty patches mean the agent hit its step or cost cap without committing to an answer; count them against the agent, not the harness.

The cost and wall-clock reality

Budget, not vibes:

Agent phase. Hard worst case is 25 instances times the $3 per-instance cap, so$ 75; runs land under that because many instances finish or fail early. Anthropic's report is the right calibration for why coding agents are expensive: successful trajectories often run to hundreds of turns and more than 100k tokens each ³. With 4 workers, expect the agent phase to take on the order of an hour.
Evaluation phase. API-free but Docker-heavy. The first run pulls or builds an image per instance, which dominates wall clock; re-runs reuse the cache (--cache_level controls how much is kept) ⁴.
Disk. The 120GB guidance is for full 500-instance runs ⁴. A 25-instance subset stays well under, but prune with docker system prune between experiments.

End to end, the first complete run of this recipe is an afternoon. Re-runs with a cached harness are about an hour.

What a subset pass@1 does and does not tell you

The number you produced is single-attempt resolved rate (pass@1) on 25 instances. Be precise about its powers:

It is an interval, not a point. 10/25 resolved is a 95% Wilson interval of roughly [0.23, 0.59]. You cannot distinguish a true 40% agent from a true 50% agent at n=25; do not promote or revert a scaffold change on a 2-instance delta ⁸. Subset runs are for harness debugging and for catching large regressions; full-set runs are for claims.
It measures model plus scaffold. A different prompt, tool set, or step cap is a different system under test ³. When you A/B two scaffolds, hold the subset, seed, and caps fixed.
It is not leaderboard-comparable. The leaderboard evaluates the full 500 with each team's own scaffold ⁹. Your stratified 25 tracks the full set in expectation but carries sampling noise; report it as "n=25 stratified subset, seed 42", never as a SWE-bench Verified score.
It is an upper bound under contamination. Every repo, issue, and gold patch in the benchmark is public GitHub data from before current models' training cutoffs; memorized fixes inflate resolved rates in ways the harness cannot detect ¹⁰. Treat absolute numbers as ceilings and put your trust in deltas measured under identical conditions.

TIP

Keep one resolved instance and one unresolved instance from your first run as permanent fixtures. Before any scaffold change ships, re-run just those two with --instance_ids. It is a thirty-cent smoke test that catches harness drift, auth breakage, and prompt regressions before you spend $75 finding out.

What you built

A reproducible coding-agent benchmark rig: a gold-patch Docker sanity check, a seeded repo-stratified 25-instance subset with its definition committed, an agent run with per-instance and global cost caps, official-harness pass verification, and a per-instance failure-pattern read. The artifacts (subset_ids.txt, preds.json, the report JSON, and the run logs) are everything a reviewer needs to re-derive your number.

What to do next

The τ-bench recipe is the tool-using-agent analog, and adds the pass^k consistency lens that single-attempt SWE-bench runs miss. The autonomous-coding task type covers the benchmark family around SWE-bench (Multimodal, Multi-SWE-bench, SWE-Lancer). If your unresolved bucket looks suspiciously low, the contamination detection recipe is the follow-up.

OpenAI, "Introducing SWE-bench Verified." 500 screened instances; 93 annotators; 38.3% underspecified and 61.1% unfair-test flag rates; GPT-4o 33.2% vs 16%. ↩ ↩² ↩³
Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arxiv 2310.06770. 2,294 instances, 12 repos, FAIL_TO_PASS and PASS_TO_PASS definitions. ↩ ↩²
Anthropic, "Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet." 49% via scaffold redesign; hundreds of turns and >100k tokens per successful run. ↩ ↩² ↩³
SWE-bench harness, GitHub. Install, gold validation, ARM namespace flag, 120GB/16GB/8-core guidance, cache levels. ↩ ↩² ↩³ ↩⁴
SWE-bench Verified dataset, HuggingFace. Per-repo instance distribution. ↩
SWE-agent team, mini-SWE-agent GitHub. Batch mode, litellm model ids, step and cost caps in swebench.yaml. ↩ ↩²
SWE-bench, sb-cli cloud evaluation documentation. ↩
Evan Miller, "Adding Error Bars to Evals." arxiv 2411.00640. ↩
SWE-bench Verified, official site and leaderboard. ↩
Jacovi et al., "Stop Uploading Test Data in Plain Text." arxiv 2305.10160. ↩