AI Evals › RAG Evals

Retriever debugging: the elimination path for RAG recall

When retrieval metrics drop, the cause is one findable failure: chunking, embedding fit, query mismatch, filters, or K. Here is the elimination path.

Retriever debugging is the process of attributing a retrieval-metric drop to one specific, findable failure: chunking, embedding fit, query-document mismatch, over-strict filters, or K set too small. The retriever is almost never "bad" in the abstract; one of those five is the cause, and each has a distinct signature in the metrics.

Start from the metric signature, not the vibe

"Retrieval got worse" is not a diagnosis. The four numbers you already log (Recall@K, Precision@K, MRR, NDCG) carry the diagnosis if you read them together. A drop in Recall@K with an unchanged Precision@K is a coverage problem: the right chunk left the top K. A drop in MRR with an unchanged Recall@K is a ranking problem: the right chunk is still in the set but lower down. IR has thirty years of these decompositions; use them instead of guessing ^[1]. The rule: read the pair of metrics that isolate coverage from ranking before you touch a single knob ^[2].

Why is the right chunk not being retrieved?

Work the tree top down. Each leaf is a distinct fix, and the branch you land on tells you which lever from the table below to pull.

flowchart TD
  S["Right chunk missing or mis-ranked"] --> Q1{"Retrieved at any K?"}
  Q1 -->|Never, even at K=100| C1["Chunking or embedding mismatch"]
  Q1 -->|Only at high K| C2["Ranking / similarity weakness"]
  C2 --> F2["Add a reranker; check embedding fit"]
  C1 --> F1["Re-chunk; try a domain embedding model"]
  S --> Q2{"Query length effect?"}
  Q2 -->|Short queries pass, long fail| C3["Query embedding degradation"]
  C3 --> F3["Split or compress the query; test HyDE"]
  S --> Q3{"Test passes, prod fails?"}
  Q3 -->|Yes| C4["Stale index or over-strict metadata filter"]
  C4 --> F4["Reindex; loosen or audit filters"]

The "never retrieved at any K" branch is the one people skip. If the gold chunk does not appear even at K=100, no amount of reranking helps; the chunk is either absent from the index (chunking split the answer, or ingestion dropped it) or the embedding places it nowhere near the query. Confirm which with hard-negative analysis below ^[3].

Reading Recall vs MRR divergence

The most informative signal is when the two metrics move in opposite directions. Recall@K up but MRR down means the relevant chunks are present but ranked lower than before. That points at the reranker or the similarity scoring, not at chunking or the embedding recall. Chunking and embedding fit change whether the chunk is in the set at all; ranking changes where it sits.

Recall@K	MRR	What broke	Where to look
down	down	fewer relevant chunks reachable	chunking, embedding model, index freshness
unchanged	down	relevant chunks present, ranked low	reranker, similarity scoring
up	down	added recall arrived below the old top hit	reranker weighting, K raised without rerank
up	up	strict improvement	ship it

A run that moves Recall@5 from 0.82 to 0.88 while MRR falls from 0.71 to 0.55 is not an improvement to the generator. The generator that reads the top chunk now gets a worse chunk more often. This is the exact regression the two-layer split is built to catch ^[4].

The precision-recall lever table

Four levers, and every one of them trades something. Pick the lever from the branch the decision tree landed on, then read what it costs. The intuition is the precision-recall tradeoff at the retrieval stage, laid out in context precision and recall ^[5].

Lever	Raises	Trades off
Increase K	Recall@K, context recall	Precision@K, prompt noise, token cost
Smaller chunks	precision, ranking granularity	recall on cross-chunk answers, more chunks to store
Larger chunks	recall on multi-sentence answers	precision, wasted context window
Better embedding model	Recall and MRR together	latency, cost, re-index effort
Add a reranker	MRR, NDCG, precision at low K	latency, a second model to eval separately

Raising K is the cheapest lever and the first one reached for, which is why it is the most over-used. It buys Recall@K by definition and pays in Precision@K and prompt noise. If MRR is the failing metric, raising K does nothing for it; a reranker does.

Hard-negative analysis

When the decision tree says "chunking or embedding mismatch," confirm it in embedding space directly. Take a failing query and its known-relevant document, embed both, and inspect the query's nearest neighbors. If the gold chunk is far from the query while unrelated chunks sit close, the embedding model or the chunking is the culprit, not the LLM and not the ranking. This is the hard-negative check from embeddings and retrieval quality: the chunks that rank near the query but are wrong are the hard negatives, and they tell you what the model confuses.

# hard_negatives.py
import numpy as np

def neighbor_report(query, gold_id, corpus_ids, embed):
    q = embed([query])[0]
    mat = np.array(embed(list(corpus_ids.values())))
    sims = mat @ q / (np.linalg.norm(mat, axis=1) * np.linalg.norm(q))
    order = np.argsort(-sims)
    ids = list(corpus_ids.keys())
    gold_rank = next(i for i, o in enumerate(order) if ids[o] == gold_id)
    top = [(ids[o], round(float(sims[o]), 3)) for o in order[:5]]
    return {"gold_rank": gold_rank, "gold_sim": round(float(sims[ids.index(gold_id)]), 3), "top5": top}

Cosine similarity is the standard vector-space score for this comparison ^[1]. A gold_rank in the hundreds with a low gold_sim confirms an embedding or chunking failure. A gold_rank in the single digits with the gold chunk just below three hard negatives confirms a ranking failure, and a reranker is the fix.

A worked example

Corpus of 500 chunks. Failing query "what is the refund window for enterprise plans." Gold chunk is chunk-317. The neighbor report returns gold_rank 4, gold_sim 0.61, and the top three are chunks about consumer refunds (0.74), enterprise billing cycles (0.69), and refund fraud (0.66). The gold chunk is close (rank 4 of 500 is the top 0.8 percent) but three hard negatives outrank it. Reciprocal rank is the inverse of the position of the first relevant hit:

\text{RR} = \frac{1}{\text{rank}_{\text{first relevant}}}

Recall@5 counts this as a hit; Recall@3 misses it; the reciprocal rank is $1/4 = 0.25$ . Reranking the top 20 with a cross-encoder is the correct move, and it lifts this query's reciprocal rank toward 1.0 without touching the embedding model.

An elimination checklist

Run it in order. Stop at the first branch that reproduces the failure.

Reproduce on one query. Confirm the gold chunk id and pull the retriever's raw top 20 with scores.
Is the gold chunk in the index at all? Grep the ingested store by id. If absent, the bug is ingestion or chunking, not retrieval.
Retrieve at K=100. Never appears: chunking or embedding (go to step 5). Appears only past your production K: ranking (go to step 6).
Check metadata filters. Disable them and re-retrieve. If the chunk returns, an over-strict filter was the cause.
Run the hard-negative report. Far gold_sim confirms embedding or chunking; re-chunk or swap the embedding model and re-measure.
Add a reranker over the top 20 and re-measure MRR and NDCG. Ranking failures resolve here.
Confirm the index is fresh. A prod-only failure with a passing test set is usually a stale index or an environment-specific filter.

TIP

Keep a running failure log of real queries that underperformed in production. Those are the highest-value additions to your retrieval gold set, because they are the failures your synthetic queries did not predict. Fold each into the labeled set after you fix it, and the regression can never silently return.

CAUTION

Compare per-query results between runs, not just the aggregate. An unchanged average can hide equal numbers of newly-passing and newly-failing queries. A change that fixes ten queries and breaks ten different queries reads as a flat Recall@5 and ships as "no regression," then a user hits one of the ten you broke. Diff the per-query pass set across runs, not the mean ^[4].

Where this connects

Retriever debugging feeds the retrieval scorecard in the retrieval metrics hub and the wired-up CI recipe in RAG eval with Ragas. The metrics tell you a regression happened; this elimination path tells you which of the five causes to fix, and the fix is almost always a specific lever, not a rewrite ^[2].

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.