What "good" looks like
There is no standard public benchmark for LLM code review. The right primary is a composite: precision of flagged issues (of the comments the model produced, how many were real?) and recall of seeded bugs (of the bugs we know are in the diff, how many did the model catch?). Both numbers matter and they trade off. A model optimized for recall alone becomes a "nit machine" that reviewers ignore; a model optimized for precision alone misses real bugs.
The "good" target is precision above 0.7 and recall above 0.5 on a seeded-bug eval set. Reviewers will stop reading suggestions when precision drops below roughly 0.5, which is the survival threshold; recall is the value-creation axis above that.
Secondary metrics
Nit rate. The proportion of style or formatting comments versus substantive bug or design comments. Duplicate comments, where the model flags the same issue in multiple places in a single PR. False-positive cost, measured as developer time wasted reading and dismissing wrong suggestions. Reviewer acceptance rate, the proportion of suggestions a human reviewer accepts or applies.
Offline eval design
Use SWE-bench Verified PRs as a clean source. The PR has a known fix; feeding the pre-fix diff to the model and checking whether the model flags the right region of code is a credible recall test[1]. For precision, seed bugs into known-clean PRs using a mutation-testing pattern (rename a variable, invert a condition, swap argument order); score whether the model flags the seeded bug and how many other false positives it generates around it.
The minimum useful artifact is 30 PRs: 15 with real bugs (gold line and gold description for each bug) and 15 clean. The clean PRs are critical; without them, the precision measurement is undefined. Score precision (true positives over true positives plus false positives) and recall (true positives over true positives plus false negatives) per release, stratified by bug severity.
Online eval design
Track suggestion acceptance rate, "resolved without applying" rate, and comment-to-merge latency. The most useful single online signal is the false-positive rate over time; rising FP rate is the leading indicator of review fatigue, which is the failure mode that retires the tool from production. Sample suggestions monthly and hand-label them; a hundred labeled live suggestions per month is enough to keep the precision number honest.
Top benchmarks
- SWE-bench Verified PRs (pre-fix), https://www.swebench.com/verified.html. Re-purposing a coding benchmark as a code-review benchmark; the gold is the known fix.
- Seeded-bug mutation set. Your own; the only way to get controlled precision and recall measurement.
- Live PR sample, hand-labeled monthly. The eval that predicts retirement risk.
Inspect AI for sandboxed reproducible runs against a fixed PR set. Custom pytest harnesses with the seeded-bug pattern. GitHub Copilot Workspace pattern for the agent surface. Braintrust or Langfuse for production trace logging with suggestion-acceptance tracking.
Common pitfalls
Recall-only metric. Without precision, the model becomes a nit machine and reviewers stop reading. The two-axis frame is the entire methodology in this category. Treating all comments as equal. A missed null-pointer is not the same as a missed unused-import; weight by severity. No "should-be-silent" PRs in the eval. Without clean PRs in the test set, precision is undefined and the headline number is meaningless. Skipping the live sample. Static evals drift away from real PR distribution; a monthly hand-label of 100 live suggestions keeps the measurement honest. Hamel's field guide framing applies: looking at the actual flagged comments is the work, not running the dashboard[2].
Canonical reading
Hamel's Field Guide is the canonical reading because the methodology in this category is essentially error analysis applied to a different surface: read the comments the model wrote, group by failure mode, fix the prompt, repeat[2]. The binary pass/fail with critique framing from Hamel's LLM-judge writeup applies directly to scoring individual flagged comments[3].
Day-1 starter
- 30 PRs: 15 with real bugs (gold line and description), 15 clean. The clean set is what makes precision computable.
- Precision (false-positive rate) and recall (hit rate) per release; weight by bug severity. Track per-category breakdown.
- Hand-label a week of live suggestions monthly. The static eval drifts; the monthly resample is the calibration.