AI Evals › RAG Evals

Context precision and recall

The Ragas pair that tells you whether the retriever found the right chunks and whether they were the only chunks worth showing the model.

Context precision and context recall are the two halves of "did the retriever do its job," reframed in the language Ragas uses. They map onto classical IR precision and recall, with two practical wrinkles: the labels can come from a gold answer instead of per-chunk relevance, and the scoring can run without graded human labels. The combination makes the metrics cheap to run at PR time, which is what matters in practice.

What the two metrics measure

Context recall asks: of everything in the gold answer, what fraction is covered by the retrieved context? Decompose the gold into atomic claims, then check whether each claim can be inferred from the retrieved chunks. High recall means the retriever surfaced what the answer needs; low recall means the model would have to invent something ^[1].

Context precision asks the dual: of the chunks returned, what fraction were actually relevant? Ragas operationalizes this with an LLM judge over each chunk, conditioned on the query (and optionally the gold). High precision means the prompt is not drowning in noise; low precision means the generator wades through unrelated text ^[2].

The pair is the precision-recall tradeoff at the retrieval layer. Higher K raises recall and lowers precision. Reranking typically raises both. Better chunking can raise either, depending on whether the misses were "answer not in the index" (recall) or "right chunk near similar but wrong chunks" (precision).

Metric	Question it answers	Needs a reference answer?	Classical analogue
Context recall	Of the gold answer, how much is covered by the retrieved context?	Yes	Recall@K
Context precision	Of the retrieved chunks, how many are relevant, ranked high?	No (uses the question)	Precision@K

Why "context" instead of "chunk"

Classical IR uses graded per-document labels. Ragas substitutes an LLM judge over chunk-claim pairs because graded labels are expensive. The substitution costs accuracy: judge noise is the dominant source of measurement error on these scores ^[3]. The practical answer is to use the LLM-judge implementation for PR-time signal and calibrate the judge against a hand-labeled set of fifty to a hundred queries before trusting absolute numbers. Calibration question: does the judge agree with the principal expert above 90 percent on a held-out set? If yes, the score is usable.

A runnable example

# context_pr.py
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import context_recall, context_precision

samples = [
    {
        "user_input": "When did the company file for IPO?",
        "reference": "The company filed for IPO on March 12, 2024.",
        "retrieved_contexts": [
            "Acme Corp filed its S-1 prospectus with the SEC on March 12, 2024 ...",
            "The 2023 annual report covered fiscal year ending December 31, 2023 ...",
            "Acme Corp was founded in 2009 by two engineers from Stanford ...",
        ],
    },
]
result = evaluate(
    dataset=Dataset.from_list(samples),
    metrics=[context_recall, context_precision],
)
print(result.to_pandas()[["context_recall", "context_precision"]])

The first query has the right chunk in position 1 plus two unrelated chunks. Context recall is high (the IPO date is covered). Context precision is around one-third (one of three is genuinely relevant). Reading the two numbers together is the diagnostic. Both high: the retriever is dialed in. High recall, low precision: lower K or add a reranker. Low recall, high precision: raise K, revisit chunking, or check the embedding model. Both low: the retrieval architecture is wrong for the corpus shape; the next move is usually hybrid search or a different chunking strategy.

Which one needs a reference answer?

Context recall requires a reference answer. Context precision can run without one. The practical consequence: precision is cheaper, because it can run on unlabeled production traces while recall needs a labeled set ^[4]. The right move is to compute precision over a sliding window of production traces (live drift signal) and recall on a static labeled set (stable coverage signal).

NOTE

Ragas Context Precision uses the gold answer when available to filter "relevant to the question" from "relevant to the expected answer." If your dataset has gold answers, pass them. If not, the metric still runs but is noisier.

Mapping to classical IR

The rough mapping: context recall is set-overlap precision and recall combined at the claim level, with a generative judge replacing graded per-document labels. Context precision is the analogue of mean average precision restricted to the top-K window. For the classical, LLM-free counterparts, see Recall@K and Precision@K; when either the classical or the Ragas score drops, retriever debugging is the elimination path. Neither Ragas metric is a literal substitute for the IR version, so do not compare Ragas numbers across versions or implementations without recalibrating ^[5]. Treat them as internal control charts, not leaderboard entries.

Thresholds in practice

For a first calibration, aim for context recall above 0.85 and context precision above 0.5. Below those, the generator either makes things up (low recall) or wades through noise (low precision). Above them, further retrieval tuning produces diminishing returns and the next bottleneck is usually the prompt or the model. Tune up for products with unambiguous gold answers (factoid lookup, structured extraction) and down for products where the gold itself is fuzzy (multi-document synthesis, subjective queries).