GraphRAG

Evaluating GraphRAG: local vs global query classes, counterbalanced pairwise judging, entity-coverage ceilings, and token-cost accounting.

What "good" looks like

GraphRAG is evaluated by query class, not by a single score: pairwise LLM-judged win rate on comprehensiveness, diversity, and empowerment for global sensemaking questions, and standard retrieval-plus-faithfulness metrics for local factual ones. A system that wins one class and loses the other is the normal case, so any single-number comparison against vector RAG is hiding the result that matters. The Edge et al. protocol is the reference design: win/lose/tie pairwise judging on the three target criteria, with directness as a control metric and answer order counterbalanced to neutralize position bias. On their two corpora (a roughly 1M-token podcast transcript set and a roughly 1.7M-token news set), GraphRAG's comprehensiveness win rates against vector RAG were 72-83% on podcasts and 72-80% on news; diversity was 75-82% and 62-71%^[1].

Scope guard: retrieval metrics themselves (Recall@K, MRR, NDCG) are covered in retrieval metrics and embeddings and retrieval; faithfulness basics live in generation faithfulness. This page owns what changes when retrieval goes through a graph: query-class splits, reference-free pairwise judging, graph-construction ceilings, and index cost.

The load-bearing idea is the query-class split. BenchmarkQED's AutoQ formalizes a 2x2: DataLocal, ActivityLocal, DataGlobal, ActivityGlobal; in Microsoft's published comparison, LazyGraphRAG won all 96 same-model comparisons, all but one (relevance on DataLocal against vector RAG) at statistical significance^[2]. The Han et al. head-to-head shows why the split cannot be skipped: winners flip by class. GraphRAG in local mode wins multi-hop questions, vector RAG wins single-hop detail questions, and GraphRAG in global mode fails both detail questions and null queries, hallucinating an answer when none exists in the corpus^[3].

Secondary metrics

Graph-construction entity coverage. The constructed knowledge graph is a hard recall ceiling: if an answer entity was never extracted, no retriever over the graph can find it. In one verified example, only about 65.8% of answer entities existed in the KG built for HotpotQA, capping achievable recall before any retrieval happened^[3]^[4]. Context tokens per query. Cost is where GraphRAG earns its keep: root-level community summaries needed over 97% fewer context tokens than map-reduce over source text, and low-level summaries 26-33% fewer^[1]. Null-query hallucination rate. The share of unanswerable queries where the system fabricates an answer instead of abstaining; global mode is the offender^[3]. Track indexing token spend and re-index staleness alongside, since the graph is a build artifact that decays as the corpus moves.

A note on RAGAS transfer. Faithfulness and response relevancy transfer cleanly to local and extractive queries; run them there as usual^[5]. Context precision and context recall do not transfer to global sensemaking questions, because there is no chunk-level gold context to score against when retrieval returns community summaries rather than source passages^[6]. That gap is why the field shifted to reference-free pairwise judging for the global class.

Offline eval design

Build the eval as a query class x metric x which-system-wins comparison table. Synthesize queries across the 2x2 (BenchmarkQED's AutoQ automates this against your own corpus), run each candidate system and your vector RAG baseline on the same set, and judge pairwise with counterbalanced answer order^[2]. Report win rate per class per criterion; never pool classes. Include directness as a control: if the "winning" system also scores lower on directness, verbosity is doing some of the winning^[1].

For evidence-chain retrieval, add MultiHop-RAG, which provides inference, comparison, temporal, and null query types with gold evidence chains^[7]. The null type is the one most teams skip and most need. Separately audit the graph itself: sample gold answers, check whether their entities and relations exist in the constructed KG, and report coverage as its own number. A retrieval fix cannot recover what extraction dropped.

Online eval design

Dashboard by query class, not in aggregate. Classify incoming queries into the 2x2 (a cheap classifier is fine) and track win-rate proxies, abstention rate, and latency per class. Monitor null-query hallucination explicitly: sample queries the system answered confidently and check whether the corpus supports any answer. Track cost per correct answer with indexing amortized in; the graph must pay its build cost back in per-query context savings, and the root-level summary numbers show it can^[1]. Alert on re-index staleness: the gap between corpus updates and the last graph build is silent recall decay.

Top benchmarks

BenchmarkQED, https://www.microsoft.com/en-us/research/blog/benchmarkqed-automated-benchmarking-of-rag-systems/. AutoQ query synthesis across the 2x2, AutoE pairwise judging, AutoD corpus prep^[2].
MultiHop-RAG, https://arxiv.org/abs/2401.15391. Evidence-chain retrieval with inference, comparison, temporal, and null query types^[7].
Edge et al. corpora, https://arxiv.org/abs/2404.16130. The podcast and news corpora plus the pairwise protocol that defined the category^[1].

Tools especially suited

Microsoft GraphRAG as the reference indexing and query pipeline, with local and global modes that map to the query classes. BenchmarkQED for query synthesis and counterbalanced judging against your own corpus. RAGAS for the local-query half of the eval, where faithfulness and response relevancy still apply^[5]. Langfuse for production traces with per-class tagging, token accounting per query, and sampled review queues for null-query audits.

Common pitfalls

Verbosity and position bias in win-rate judging. Judges reward longer answers and first-position answers; counterbalance answer order and keep directness as a control metric, as the original protocol does^[1]. Evaluating only one query class. Global-mode strengths on sensemaking mask failures on detail questions and null queries; the Han et al. results show winners flip by class, so a global-only eval will ship a system that hallucinates on unanswerable questions^[3]. Ignoring index cost and staleness. The graph is expensive to build and goes stale as the corpus changes; an eval that scores query quality but not indexing spend and re-index lag is pricing half the system. Treating extraction as free. Entity-coverage ceilings mean graph-construction quality bounds everything downstream; measure it directly rather than inferring it from end-to-end scores^[3].

Canonical reading

Edge et al. (2024) is the canonical paper: it defines the local-vs-global distinction, the community-summary architecture, and the counterbalanced pairwise protocol with a directness control that the rest of the field adopted^[1]. Read it for the eval design as much as the system design. Follow with BenchmarkQED for the automated, run-it-on-your-corpus version of the same protocol^[2].

Day-1 starter

Synthesize about 40 queries across the 2x2 classes (DataLocal, ActivityLocal, DataGlobal, ActivityGlobal) against your own corpus, plus a handful of null queries with no corpus-supported answer.
Run counterbalanced pairwise judging of GraphRAG against your existing vector RAG baseline on comprehensiveness, diversity, and empowerment, with directness as the control.
Report win rate per class, context tokens per query, and total indexing tokens. If GraphRAG only wins the global classes, that is the expected result; route by class rather than replacing the baseline.

Cross-cutting links

Retrieval metrics: Recall@K, MRR, and NDCG for the local-query half of the eval.
Groundedness: the faithfulness machinery that still applies to extractive answers.
Long-context RAG: the competing architecture for corpus-scale questions.
Embeddings and retrieval: the vector baseline every GraphRAG comparison needs.
Pairwise comparison and Elo: the judging technique behind the primary metric.
RAG eval with RAGAS: hands-on RAGAS setup for the metrics that do transfer.

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.

Why evals matter
Your first eval in 30 minutes
LLM-as-Judge introduction
RAG eval in two layers
Trajectory eval for agents
Glossary
Papers