What "good" looks like
A deep research agent is evaluated on the report, not the run: a long cited document scored against per-task weighted rubric criteria and a reference report, plus a separate check that every citation actually supports the claim it is attached to. Free-form "which report is better" judging drifts; anchored rubric scoring does not. The RACE framework from DeepResearch Bench is the current best articulation: it scores comprehensiveness, insight and analytical depth, instruction-following, and readability using criteria generated dynamically per task, weighted, and scored relative to a gold reference report rather than in isolation. The paper reports 71.33% pairwise agreement with human experts, above the 68.44% agreement rate between the human experts themselves, which is the comparison that makes the method usable as a primary metric[1].
The benchmark itself is 100 PhD-level research tasks across 22 fields. At paper time, Gemini-2.5-Pro Deep Research led with a RACE score of 48.88 and citation accuracy of 81.44; OpenAI Deep Research scored 46.98 with 77.96 citation accuracy; Perplexity posted the highest citation accuracy at 90.24[1]. Treat these as paper-time anchors, not current standings; check the live leaderboard before quoting them.
Scope guard: this page owns report-shaped outputs. Pass^k and state-check harness design live in long-horizon planning; the browsing execution layer (clicks, page parsing, dead-end recovery) lives in computer and browser use.
Secondary metrics
Citation precision via the FACT pipeline: extract every claim-citation pair from the report, fetch the cited source, and verify support. Report citation precision and effective citation count; the paper validates FACT at 96% alignment with human judgments on supported claims and 92% on unsupported ones[1]. Search competence on short-answer sets, as a floor check rather than the headline. BrowseComp is 1,266 questions built to require finding hard-to-locate information; at paper time GPT-4o scored 0.6%, GPT-4o with browsing 1.9%, o1 9.9%, and OpenAI's Deep Research 51.5%, against 29.2% for human trainers[2]. GAIA is 466 questions in three difficulty levels; humans hit 92% while GPT-4 with plugins managed roughly 15% at paper time[3]. Do not quote post-paper saturation claims for either set; the live leaderboards are the source of truth. Cost and latency per report, since a slow, expensive report is a different product from a fast, cheap one.
Calibration error deserves its own line. The BrowseComp authors observed that browsing raises the model's stated confidence while its errors become subtler[2]. An agent that cites sources and sounds certain is more dangerous when wrong, so track confidence-vs-correctness separately.
Offline eval design
Build rubric trees with domain experts, not with a single prompt. The PaperBench precedent shows what rigor looks like at the limit: 20 ICML 2024 papers decomposed into 8,316 individually gradable leaf nodes, with the LLM judge itself audited against a separate JudgeEval set[4]. You do not need 8,316 nodes; you do need the two ideas, hierarchical decomposition into checkable leaves and an audited judge.
The three public benchmarks measure different things, and teams routinely confuse them:
| BrowseComp | GAIA | DeepResearch Bench |
|---|
| Task shape | Find one hard-to-locate fact | Multi-step assistant questions, 3 levels | Open-ended PhD-level research briefs |
| Output shape | Short answer | Short answer | Long cited report |
| Scoring | Exact-match | Exact-match | RACE rubric plus FACT citation checks |
| Does not measure | Synthesis, writing, citation | Synthesis, long-form quality | Retrieval persistence on adversarially hidden facts |
Run DeepResearch Bench (or a domain-specific clone of its method) as the primary, with BrowseComp-style short-answer sets as the retrieval floor. Use RACE-style reference-based relative scoring: absolute scalar scores on long reports are unstable across judge runs; scoring against a gold report is not[1].
Online eval design
Three loops. First, sampled claim-level citation audits: pull N reports per week, extract claim-citation pairs, and have a human or a FACT-style pipeline verify support; track precision over time. Second, counterbalanced human preference pairs: when comparing model versions, show reviewers both reports with order randomized per pair, and never let one side be systematically longer without a directness control. Third, cost and latency per report with alerts on drift; deep research runs are expensive enough that a sizeable cost regression is a product incident, not a curiosity.
Top benchmarks
The deep_research_bench repo is Apache-2.0 and ships the RACE and FACT scoring scripts, so you can point them at your own tasks rather than reimplementing the judging[5]. Inspect AI as the harness for running agents against task sets with full trace capture. Langfuse or Braintrust for production report logging and the annotation queues that citation audits need. Where the judge itself must browse or run tools to verify claims, the agent-as-judge pattern applies: a judge with tool access grading intermediate artifacts, not just final text[6].
Common pitfalls
Verbosity and self-preference bias in win-rate judging. Long reports win pairwise comparisons by default, and judges prefer their own model family's prose. Mitigate with counterbalanced ordering, a directness dimension in the rubric, a judge from a different family than the generator, and a reported judge-human agreement number; if you cannot state that number, the win rate is unaudited. Treating citation accuracy as factual correctness. FACT verifies that a source supports a claim, not that the source is right; a report can be almost perfectly citation-accurate and still materially wrong. Letting short-answer benchmarks stand in for synthesis quality. BrowseComp and GAIA measure retrieval persistence; a model can ace them and still write shallow, badly structured reports. Ignoring contamination on GAIA. Its validation answers are public and old enough to be in training data; treat strong GAIA scores from recent models with suspicion[3].
Canonical reading
The DeepResearch Bench paper is the canonical read because it solves the two hard problems of this category in one method: RACE makes long-report quality scoring reproducible via reference-based relative rubrics, and FACT makes citation quality measurable at claim level[1]. The 71.33% human agreement (against a 68.44% human-to-human baseline) and the 96%/92% claim-verification alignment are the evidence that judge-based scoring can be trusted here, provided you keep auditing it.
Day-1 starter
- Pick 20 real research queries from your production logs. Not synthetic prompts; the queries users actually ran.
- Write 5 to 10 weighted rubric criteria per query with domain experts, plus one reference report per query to anchor relative scoring.
- Score with a judge model from a different family than the generator, and hand-verify 3 citations per report. The delta between judge citation scores and your hand checks is your judge's error bar.
Cross-cutting links