AI Evals › RAG Evals

RAG generation metrics: faithfulness to completeness

Generation quality is a vector, not a scalar. Faithfulness, groundedness, correctness, answer relevancy, and completeness each catch a different failure.

RAG generation metrics measure whether the answer, given the retrieved context, is faithful, grounded, correct, relevant, and complete. These are five distinct dimensions that compete with each other, and no single score captures them; an answer can be perfectly faithful to the context and still factually wrong, or word-for-word grounded and still miss half the question.

Once retrieval is doing its job, the generation stage owns a different set of failures than the retrieval stage. This page is the map. It does not restate the mechanics of each metric; the linked chapters do that. It tells you which metric answers which question, why collapsing them into one number destroys the signal, and how to read the five together.

The five dimensions

Split generation quality along the axis of what each metric can catch. Faithfulness and groundedness both ask about the relationship between the answer and the context, and neither needs a reference answer. Correctness, answer relevancy, and completeness compare the answer to an external standard: reality, the question, or a rubric. The distinction between "faithful to the context" and "correct against reality" is the one teams collapse most often and regret ^[1].

Metric	Question it answers	Needs a reference answer?	Failure it catches	Canonical home
Faithfulness	Does the answer avoid contradicting the retrieved context?	No	Hallucination relative to context	faithfulness
Groundedness	Is each claim attributable to a specific span?	No	Unsupported inference	groundedness
Correctness	Is the answer actually right vs reality?	Yes	Factually wrong despite being faithful	answer correctness
Answer relevancy	Does it address the question asked?	No	On-context but off-question	answer correctness
Completeness	Does it cover every required part?	Rubric or reference	Silent omission	answer correctness

Faithfulness versus groundedness

The two look identical until you read the failure column. Faithfulness asks whether the answer contradicts the context. Groundedness asks whether every claim traces to a specific span. An answer can be faithful (nothing it says conflicts with the context) while being ungrounded (it asserts a claim the context neither states nor contradicts, pulled from parametric memory). Both are measured by decomposing the answer into atomic claims and judging each against the context, the FActScore and Ragas pattern ^[1] ^[2]. The difference is the label: faithfulness fails on CONTRADICTED, groundedness fails on anything short of SUPPORTED-WITH-SPAN.

Faithfulness versus correctness

This is the pair that costs teams the most. Faithfulness is a relationship between the answer and the retrieved context. Correctness is a relationship between the answer and reality, which means it requires a reference answer ^[3]. The two diverge in both directions. A model can faithfully report stale context and be wrong. A model can ignore correct context, answer from memory, be right, and score low on faithfulness. You need both because each isolates a different root cause: faithfulness points at the generator, correctness points at the corpus.

Why not collapse these into one score?

Because the combination is the diagnosis, and the average erases it. Read the pattern across the five, not the mean.

High faithfulness, low correctness: the context is stale or wrong. The generator is doing its job; the corpus is not.
Low faithfulness: the model is ignoring the retrieved context and answering from parametric memory.
High on everything except relevancy: the answer is faithful and correct but does not address the question. That is a prompting problem, not a retrieval or corpus problem.
Low completeness with everything else high: the model is silently dropping required parts of a multi-part answer.

A 0.75 average hides whether every answer is 75 percent faithful or half are perfect and half fabricated. Those two distributions demand opposite fixes: the first is a systematic prompt or model issue, the second is a slice-specific failure you find by cutting the data. The average is the one number that cannot tell them apart. This is the same failure that reference-based summarization scores exhibit, where a single correlation number hides which examples the metric got right ^[4].

Where each metric sits in the pipeline

Retrieved context flows into the answer, and the five checks fan out from the answer. Two checks (faithfulness, groundedness) fold the context back in; three (correctness, relevancy, completeness) compare against an external standard.

flowchart TD
  Q["Query"] --> R["Retrieved context"]
  R --> A["Generated answer"]
  A --> F["Faithfulness: no contradiction vs context"]
  A --> G["Groundedness: each claim has a span"]
  A --> C["Correctness: right vs reality"]
  A --> V["Relevancy: addresses the question"]
  A --> P["Completeness: covers every part"]
  R -.context.-> F
  R -.context.-> G
  REF[("Reference / rubric")] -.-> C
  REF -.-> P
  Q -.question.-> V

A worked reading of the vector

Take one answer scored on all five. The query is "List the side effects and the recommended dose." The context covers side effects fully and states the dose once. The answer reports three side effects (all in the context), invents a fourth from memory, gets the dose right, and never mentions that the dose is once daily.

Dimension	Score	Why
Faithfulness	0.75	3 of 4 claims supported; the invented side effect is unsupported
Groundedness	0.75	same 3 of 4 claims trace to a span
Correctness	1.0	the dose and the three real side effects match reality
Answer relevancy	1.0	the answer is on-topic for both parts of the question
Completeness	0.67	2 of 3 required parts covered; the frequency is omitted

The mean is $(0.75 + 0.75 + 1.0 + 1.0 + 0.67) / 5 = 0.834$ . That 0.83 tells you nothing actionable. The vector tells you exactly two things: the generator hallucinated one side effect (fix the prompt or add a stricter grounding constraint), and it dropped the dosing frequency (fix completeness, likely a coverage-rubric gap). Verify the mean: $0.75 + 0.75 = 1.5$ ; $1.5 + 1.0 + 1.0 = 3.5$ ; $3.5 + 0.67 = 4.17$ ; $4.17 / 5 = 0.834$ .

The formula the hub shares

Faithfulness, groundedness, and completeness are all supported-fraction metrics over decomposed claims. The shared shape:

\text{score} = \frac{\lvert \{\, c \in \text{Claims} : \text{verify}(c) = \text{SUPPORTED} \,\} \rvert}{\lvert \text{Claims} \rvert}

For faithfulness and groundedness, Claims are the atomic facts in the answer and verify checks each against the retrieved context. For completeness, Claims are the required points from the rubric or reference and verify checks each against the answer. Correctness and answer relevancy use their own definitions; see the answer correctness chapter.

How do you compute the vector in one pass?

Run the decomposition once, then apply the relevant verifier per metric. The atomic-fact score reused from the faithfulness chapter is the same function that backs groundedness and completeness; only the claim set and the verification target change.

# generation_vector.py
def supported_fraction(claims: list[str], verify) -> float:
    if not claims:
        return 0.0
    supported = sum(1 for c in claims if verify(c) == "SUPPORTED")
    return supported / len(claims)

def generation_vector(answer_claims, required_points, context, answer, verify):
    return {
        # answer claims judged against retrieved context
        "faithfulness": supported_fraction(
            answer_claims, lambda c: verify(c, context)),
        # required rubric points judged against the answer
        "completeness": supported_fraction(
            required_points, lambda c: verify(c, answer)),
    }

The point is discipline, not cleverness: one decomposition, several verifiers, five numbers per example, never one average across them.

TIP

Read the vector of metrics per example, not one aggregate number. The aggregate is fine for a release gate once you know the distribution; it is useless for debugging. Log all five per answer and cut by slice (single-source, multi-source, multi-part question) before you touch the mean.

Tools that compute these

Both major RAG eval libraries ship the full set as named metrics, so you rarely implement the verifier yourself for a first pass. Ragas implements faithfulness, answer relevancy, and correctness as LLM-judge metrics with per-claim rationale ^[5]. DeepEval ships the same dimensions under its own metric names with pytest integration. Use the off-the-shelf metric for the first version; swap in a custom decomposer when your domain has unusual claim boundaries. The RAG eval with Ragas recipe wires the vector into a scored dataset end to end.

CAUTION

A judge from the same model family as the generator inflates faithfulness and correctness together, so the vector looks internally consistent while every number is high for the wrong reason. Calibrate each metric against human labels on a held-out set before trusting absolute values, and prefer an out-of-family judge.

For the retrieval half of the pipeline, pair these with context precision and recall and the IR metrics on the retrieval metrics hub. Generation metrics tell you whether the answer used the context well; retrieval metrics tell you whether the context was worth using.

AI Evals

The most useful, opinionated reference for AI evals.

AI Evals is a curated technical reference for evaluating large language model systems in production. The site covers error analysis, LLM-as-judge calibration, RAG evaluation, agentic eval, statistical rigor, and twenty task-specific eval playbooks. Every claim is cited.