Code review

Model reads a diff and flags real issues with actionable comments. Precision plus recall of seeded bugs, not recall alone (which produces nit machines).

What "good" looks like

A code-review eval scores two things that trade off: precision of the comments the model produced (how many were real?) and recall of the bugs we know are in the diff (how many did the model catch?). There is no standard public benchmark, so the primary is this composite, built on your own seeded-bug set.

The ship targets on a seeded-bug eval set:

Precision above 0.7 of flagged issues.
Recall above 0.5 of seeded bugs.

Both numbers matter. A model optimized for recall alone becomes a "nit machine" that reviewers ignore. A model optimized for precision alone misses real bugs.

Precision is the survival axis. Reviewers stop reading suggestions when precision drops below roughly 0.5, which is the survival threshold. Recall is the value-creation axis above that.

flowchart TD
    DIFF["Diff, seeded-bug set: 15 buggy, 15 clean"] --> MODEL["Model review comments"]
    MODEL --> P["Precision: of comments produced, how many real?"]
    MODEL --> R["Recall: of seeded bugs, how many caught?"]
    P --> GATE{"Precision above 0.7 and recall above 0.5?"}
    R --> GATE
    GATE -->|yes| SHIP["Ship"]
    GATE -->|no| FIX["Fix the weaker axis; precision below 0.5 retires the tool"]

Secondary metrics

Metric	What it catches	How to score
Nit rate	Style or formatting comments crowding out substantive ones	Proportion of style comments versus substantive bug or design comments
Duplicate comments	The same issue flagged in multiple places in one PR	Count repeated flags of a single issue per PR
False-positive cost	Developer time wasted reading and dismissing wrong suggestions	Time spent per dismissed suggestion
Reviewer acceptance rate	Suggestions humans actually use	Proportion of suggestions a reviewer accepts or applies

Offline eval design

Recall from real fixes. Use SWE-bench Verified PRs as a clean source. The PR has a known fix, so feeding the pre-fix diff to the model and checking whether it flags the right region of code is a credible recall test^[1].

Precision from seeded bugs. Seed bugs into known-clean PRs using a mutation-testing pattern: rename a variable, invert a condition, swap argument order. Score whether the model flags the seeded bug and how many other false positives it generates around it.

The minimum useful artifact is 30 PRs:

15 with real bugs, each with a gold line and gold description.
15 clean.

The clean PRs are critical; without them, the precision measurement is undefined. Score precision (true positives over true positives plus false positives) and recall (true positives over true positives plus false negatives) per release, stratified by bug severity.

Online eval design

Track the leading indicators. Watch suggestion acceptance rate, "resolved without applying" rate, and comment-to-merge latency.

The most useful single online signal is the false-positive rate over time. Rising FP rate is the leading indicator of review fatigue, which is the failure mode that retires the tool from production.

Keep the number honest. Sample suggestions monthly and hand-label them. A hundred labeled live suggestions per month is enough to keep the precision number honest.

Top benchmarks

SWE-bench Verified PRs (pre-fix), https://www.swebench.com/verified.html. Re-purposing a coding benchmark as a code-review benchmark; the gold is the known fix.
Seeded-bug mutation set. Your own; the only way to get controlled precision and recall measurement.
Live PR sample, hand-labeled monthly. The eval that predicts retirement risk.

Tools especially suited

Inspect AI: sandboxed reproducible runs against a fixed PR set.
Custom pytest harness: the seeded-bug pattern in code.
GitHub Copilot Workspace pattern: the agent surface.
Braintrust or Langfuse: production trace logging with suggestion-acceptance tracking.

Common pitfalls

Pitfall	Why it bites	Fix
Recall-only metric	Without precision, the model becomes a nit machine and reviewers stop reading; the two-axis frame is the entire methodology	Score precision alongside recall
Treating all comments as equal	A missed null-pointer is not the same as a missed unused-import	Weight by severity
No "should-be-silent" PRs in the eval	Without clean PRs in the test set, precision is undefined and the headline number is meaningless	Include clean PRs
Skipping the live sample	Static evals drift away from real PR distribution	Hand-label 100 live suggestions monthly to keep the measurement honest
Running the dashboard instead of reading comments	Looking at the actual flagged comments is the work	Read the comments; Hamel's field guide framing applies^[2]

Canonical reading

Hamel's Field Guide is the canonical reading because the methodology in this category is essentially error analysis applied to a different surface: read the comments the model wrote, group by failure mode, fix the prompt, repeat^[2]. The binary pass/fail with critique framing from Hamel's LLM-judge writeup applies directly to scoring individual flagged comments^[3].

Day-1 starter

30 PRs: 15 with real bugs (gold line and description), 15 clean. The clean set is what makes precision computable.
Precision (false-positive rate) and recall (hit rate) per release; weight by bug severity. Track per-category breakdown.
Hand-label a week of live suggestions monthly. The static eval drifts; the monthly resample is the calibration.

Cross-cutting links

Open coding for AI traces: the monthly hand-label pass, done properly.
Error analysis (open coding): the technique reference for failure-mode discovery.
Inspect AI: the harness for the seeded-bug precision and recall eval.
Braintrust: logging suggestions and acceptance outcomes per release.
SWE-bench Verified walkthrough: mining Verified PRs for known-bug review cases.