AI Evals › RAG Evals

Precision@K: signal purity in RAG retrieval

Precision@K is the fraction of the top K retrieved chunks that are relevant. It measures noise, and it is not Ragas Context Precision.

Precision@K is the fraction of the top K retrieved chunks that are actually relevant to the query. It measures signal purity: how much of what you handed the generator was worth handing over.

What Precision@K measures

Precision@K asks the question the generator cares about at prompt-build time. Of the K chunks you are about to paste into the context window, how many earn their tokens? A retriever with Precision@10 of 0.30 wastes seven of every ten slots on noise. That noise costs latency, costs money, and dilutes the evidence the model has to reason over. Precision@K is the classical information-retrieval definition, unchanged for decades and standard across the IR literature ^[1].

The metric is a set operation over the top K. It does not read the chunks, does not call a model, and does not care about order within the window. Relevance is a binary label from your ground truth. That makes Precision@K cheap, deterministic, and reproducible, which is exactly what you want in a CI gate ^[2].

The formula

$\text{Precision@K} = \frac{|\{\text{relevant}\} \cap \{\text{retrieved}_{1..K}\}|}{K}$

The numerator counts relevant chunks inside the top K. The denominator is K, always. This is the one thing to hold onto: the numerator is identical to the Recall@K numerator, but the denominator differs. Recall divides by the total count of relevant chunks for the query; precision divides by K. Same hits, different question. Recall asks "did we find them," precision asks "how clean is what we returned."

A worked example

A retriever returns 10 chunks for a query. Ground truth marks 3 of those 10 as relevant. The corpus contains 4 relevant chunks in total.

$\text{Precision@10} = \frac{3}{10} = 0.30$

$\text{Recall@10} = \frac{3}{4} = 0.75$

Precision is 0.30: seven of the ten chunks are noise. Recall is 0.75: the retriever found three of the four relevant chunks that exist. Same numerator (3). Different denominator (K=10 for precision, 4 total relevant for recall). One number without the other hides half the picture. Precision alone would read as a weak retriever; recall alone would read as a strong one. Both are true at once.

precision_at_k in Python

# retrieval_metrics.py
def precision_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    if k <= 0:
        return 0.0
    return len(set(retrieved[:k]) & relevant) / k

This mirrors the Recall@K implementation exactly except for the denominator: k instead of len(relevant). Average across the query set to get the reported number. Fix K to your production window; a Precision@K computed at a K you never ship is decoration.

Why can't one K maximize both precision and recall?

You cannot. Precision and recall trade off against each other as K moves, and no single K optimizes both.

Raise K and you admit more chunks. Recall can only rise or hold, because a larger window can only contain more of the relevant set. Precision usually falls, because the marginal chunks you admit at higher ranks are less likely to be relevant than the ones above them. Lower K and the reverse happens: precision climbs as you keep only the highest-confidence chunks, recall drops as relevant chunks fall outside the window. The extremes make it obvious. At K=1 precision is often high and recall is capped low; at K equal to corpus size recall is 1.0 and precision is near zero ^[1].

The practical consequence: pick K from the generator's appetite, not from the metric. A model grounding tightly on each chunk wants high precision, so favor a small K. A model synthesizing across many chunks tolerates noise for coverage, so favor recall at a larger K ^[3].

TIP

Report Precision@K and Recall@K together, always, at the same K. One number alone cannot tell you whether a retriever is noisy or incomplete. High recall with low precision means the answer is buried in noise; low recall with high precision means the window is clean but the evidence is missing. The fix differs for each, so you need both to know which lever to pull.

How is Precision@K different from Ragas Context Precision?

They share a name and almost nothing else. Classical Precision@K is a set-membership count against binary ground-truth labels. Ragas Context Precision is an LLM judge that scores each retrieved chunk for relevance conditioned on the question (and the reference answer), then weights by rank so relevant chunks near the top score higher ^[4].

Dimension	Classical Precision@K	Ragas Context Precision
Relevance source	binary ground-truth labels	LLM judge per chunk
Conditioned on question	no (labels are fixed)	yes
Rank-aware	no (order-agnostic within K)	yes (rank-weighted)
Determinism	fully deterministic	judge variance across runs
Cost	one set intersection	one LLM call per chunk
Needs labeled relevance	yes	reference answer, not per-chunk labels

Use classical Precision@K when you have labeled relevance and want a cheap, stable CI gate. Use Ragas Context Precision when you lack per-chunk labels but have reference answers, and you want an order-sensitive read on whether the good chunks landed near the top ^[5]. They answer different questions; run both when you can. The context precision and recall page covers the Ragas side in full.

MAP and graded relevance

Classical Precision@K is order-agnostic inside the window: three relevant chunks at ranks 1, 2, 3 score the same as the same three at ranks 8, 9, 10. When rank matters, use Mean Average Precision. MAP computes precision at each rank where a relevant chunk appears, averages those per query, then averages across queries. It rewards putting relevant chunks earlier and is the rank-aware generalization of Precision@K for binary relevance ^[1].

MAP still treats relevance as binary. When relevance is graded (some chunks are more relevant than others) and position matters, move to NDCG, covered in NDCG. NDCG applies a graded gain with a logarithmic position discount, which is the right tool when both grade and rank drive generation quality ^[3].

Where this fits in debugging

Precision@K is the first number to check when the generator hallucinates or drifts despite decent recall: a clean recall score with poor precision means the model is drowning in irrelevant context. Log Precision@K next to Recall@K per release, per chunking strategy, per embedding model, and follow the retriever debugging workflow when the pair moves in opposite directions. The full metric set lives on the retrieval metrics hub, and the end-to-end wiring is in the RAG eval with Ragas recipe.

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.