AI Evals › RAG Evals

Answer correctness, relevancy, and completeness

The three generation dimensions faithfulness misses: matching reality, addressing the question, and covering every part of it.

Correctness, answer relevancy, and completeness are the three generation dimensions that faithfulness does not capture: whether the answer matches reality (judged against a reference), whether it addresses the question asked, and whether it covers every required part of that question. Faithfulness only asks whether the answer is entailed by the retrieved context. An answer can be perfectly faithful and still be wrong, off-topic, or half-finished.

The four checks live on different axes, and a RAG system can pass three and fail the fourth. Track them separately or you will misdiagnose every regression.

Correctness

Correctness is agreement with a human-written gold answer, or with a trusted external source. Measure it by comparison to the reference, not by comparison to the retrieved context. Do not use exact string match: paraphrase and formatting kill it on anything longer than a single token. Use an LLM judge scoring the response against the gold answer, or a semantic-similarity threshold, or atomic-fact decomposition against the reference in the FActScore style ^[1]. Ragas ships answer_correctness as a weighted blend of a factual-overlap judge and semantic similarity ^[2].

The load-bearing point: faithful is not correct. Faithfulness is agreement with the retrieved context. Correctness is agreement with reality. The two diverge whenever the context itself is stale or wrong.

Worked example. Question: "What is the current federal funds rate?" The retrieved chunk is a cached document that says the rate is 4 percent. The real current rate is 5.5 percent. The model answers "The rate is 4 percent." Faithfulness scores 1.0: every claim is entailed by the context. Correctness scores 0.0: the answer contradicts the gold reference. A dashboard that only tracks faithfulness reports a green board while the product ships a wrong number.

Correctness needs a reference, which is the expensive part. Where you have no gold set yet, evals without a gold dataset sequences the reference-building work; and when the failure is upstream stale retrieval, context precision and recall tells you whether the retriever even had a chance.

Answer relevancy

Answer relevancy asks whether the response actually addresses the question, independent of whether it is true. An answer can be correct, well-grounded, and entirely beside the point.

The standard technique is reverse-question generation, the method Ragas uses for answer_relevancy ^[3]. Given the response, prompt an LLM to generate N questions that the response would answer. Embed each generated question, embed the original question, and score relevancy as the mean cosine similarity between them.

\text{AnswerRelevancy} = \frac{1}{N} \sum_{i=1}^{N} \cos\!\left(E(q_i),\, E(q)\right)

Here $q$ is the original question, $q_i$ is the $i$ -th question generated from the answer, and $E(\cdot)$ is the embedding function. If the answer is on-topic, the questions it back-generates cluster near the original; if it drifts, they scatter.

Worked example. Question: "How do I reset my password?" Answer: "Our security team enforces a 90-day password rotation policy and requires 12-character minimums." Every claim is true and grounded in the retrieved policy doc, so faithfulness and correctness both pass. Reverse-generate three questions from the answer: "What is the password rotation period?", "What is the minimum password length?", "What does the security team enforce?" None matches "How do I reset my password?" The cosine similarities land around 0.35, 0.30, and 0.28, for a mean near 0.31. The answer is well-grounded and useless.

Answer relevancy is reference-free: it needs only the question and the response, so it runs on unlabeled production traces.

Completeness

Completeness asks whether the answer covers every required part of a multi-part question. Decompose the question into its required sub-parts, then check that the answer addresses each one, the same atomic-decomposition idea FActScore applies to claims ^[1]. A partial answer that nails one clause and silently drops another fails completeness while passing faithfulness and often correctness on the part it did answer.

Worked example. Question: "What is the return window, and does it cover opened items?" This is two required parts. Answer: "You can return items within 30 days of purchase." The return-window part is correct and faithful. The opened-items part is silently dropped. Completeness is 1 of 2 required parts covered, so 0.5. Faithfulness on the sentence produced is 1.0, which is exactly why faithfulness cannot see the missing half.

CAUTION

Completeness is not Context Recall. Completeness measures the answer's coverage of the question (a generation metric). Context Recall measures the retrieved context's coverage of the reference (a retrieval metric). They fail independently: the retriever can surface both parts (high Context Recall) while the generator answers only one (low completeness). Diagnose each with the right layer; context precision and recall covers the retrieval side.

How do these differ from faithfulness and groundedness?

Each metric compares a different pair of objects and catches a different failure. Faithfulness and groundedness both compare the answer to the retrieved context and catch fabrication. Correctness compares the answer to a reference and catches factual error even when the context is trusted. Relevancy compares the answer to the question and catches on-topic drift. Completeness compares the answer to the question's parts and catches silent omission.

Metric	Compares answer to	Needs reference?	Failure caught
Faithfulness	retrieved context	no	fabrication, unsupported claims
Groundedness	retrieved context	no	claims not traceable to a source
Correctness	gold reference	yes	true-to-context but wrong in reality
Answer relevancy	the question	no	grounded answer that never addresses the question
Completeness	the question's parts	no (parts derived from the question)	one part answered, another dropped

The pattern: faithfulness and groundedness look backward at the context; correctness looks outward at reality; relevancy and completeness look forward at the question. No single metric spans all three directions ^[4].

When is a reference required?

Only correctness strictly needs a human-written gold answer. Relevancy is reference-free by construction (reverse-question). Completeness derives its required parts from the question itself, so it needs no separate gold answer, though a gold answer sharpens the part list. Faithfulness and groundedness are reference-free (they judge against the retrieved context) ^[5].

The practical consequence mirrors the retrieval side: the reference-free metrics run on live production traces for a drift signal; correctness runs on a static labeled set for a stable truth signal. Compute relevancy and completeness continuously; compute correctness on a labeled gold set per release.

A minimal implementation

Reverse-question relevancy in a few lines. This mirrors the Ragas answer_relevancy mechanic without the framework.

# answer_relevancy.py
import numpy as np
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-opus-4-7"

def embed(texts: list[str]) -> np.ndarray:
    # swap in your embedding endpoint; must return L2-normalized rows
    ...

def gen_questions(answer: str, n: int = 3) -> list[str]:
    prompt = (
        f"Generate {n} distinct questions that the following answer "
        f'fully answers. Return one per line.\n\nAnswer: {answer}'
    )
    msg = client.messages.create(
        model=MODEL, max_tokens=300, temperature=0.3,
        messages=[{"role": "user", "content": prompt}],
    )
    return [l.strip() for l in msg.content[0].text.splitlines() if l.strip()]

def answer_relevancy(question: str, answer: str, n: int = 3) -> float:
    gen = gen_questions(answer, n)
    vecs = embed([question] + gen)  # rows already normalized
    q, qi = vecs[0], vecs[1:]
    return float(np.mean(qi @ q))   # mean cosine similarity

For completeness, decompose the question into required parts with one LLM call, then reuse the per-claim verifier from faithfulness to check each part against the answer instead of against the context. The score is the fraction of required parts the answer addresses. For a production setup, wire all four metrics through one library: the cookbook recipe RAG eval with Ragas runs correctness, relevancy, faithfulness, and the context pair on a single dataset pass with Ragas.

Reading the four together on a dashboard

TIP

Track correctness, relevancy, and completeness as separate series from faithfulness. The isolated movement is the diagnostic. A drop in relevancy alone, with faithfulness and context metrics flat, points at a prompting problem, not retrieval: the generator is answering a nearby question. A drop in correctness with faithfulness pinned at 1.0 points at stale or wrong context, not the generator. A drop in completeness with relevancy flat points at truncation or a prompt that stops early. Collapse them into one "quality" number and every one of these signals disappears.

The generation side has four axes and they fail independently. Faithfulness catches fabrication, correctness catches wrongness, relevancy catches drift, completeness catches omission. Log all four per release and per slice; the generation metrics hub lists the full set and where each one belongs in the pipeline ^[2].

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.