Positioning
The llama_index.core.evaluation module is LlamaIndex's first-party home for RAG evaluation. It ships faithfulness, relevancy, correctness, semantic similarity, and retrieval-quality metrics, plus utilities for synthetic question generation from a corpus. The module also wraps integrations into UpTrain, DeepEval, Ragas, and Tonic so that teams who want a different metric library can keep the LlamaIndex pipeline and swap evaluators 1.
The fit is teams already on LlamaIndex for ingestion, chunking, and retrieval; the eval module avoids a parallel toolchain. For teams not using LlamaIndex the integration story is less compelling than picking Ragas or DeepEval directly.
Strengths
- In-package integration. Datasets, retrievers, query engines, and evaluators share the same abstractions. There is no impedance mismatch between the pipeline under test and the eval harness.
- Synthetic question generation. The
RagDatasetGenerator builds query-context-answer triples from a corpus, which is useful for bootstrapping a small eval set before you have human-labeled data.
- Pluggable metric backends. UpTrain, DeepEval, Ragas, and Tonic adapters mean you can pick the metric library that fits your taste without rewriting the pipeline. RAGAS's faithfulness, context precision, and context recall are the most commonly used 2.
Limitations
- Tied to LlamaIndex abstractions. If your stack is built on a different orchestration framework (LangChain, raw OpenAI SDK, custom), the integration benefit disappears and you are better off with the metric library directly.
- Less polished UI than hosted hubs. LlamaIndex evals run from notebooks or scripts; there is no first-party dashboard. Pair with Phoenix, Langfuse, or Braintrust for visualization.
- Vendor-published benchmarks. Performance claims from the LlamaIndex project should be treated as vendor positioning.
Best fit
RAG evaluation for teams already building on LlamaIndex, synthetic dataset construction from a corpus when you have no human-labeled set, and pipelines that want to keep the eval logic in the same Python module as the retrieval and generation code.
Getting started
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
RagDatasetGenerator,
)
# Index a corpus and build a query engine.
docs = SimpleDirectoryReader("data/").load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
# Generate a synthetic eval set from the corpus.
generator = RagDatasetGenerator.from_documents(docs, num_questions_per_chunk=2)
dataset = generator.generate_questions_from_nodes()
# Evaluate the pipeline.
faithfulness = FaithfulnessEvaluator()
relevancy = RelevancyEvaluator()
for q in dataset.queries.values():
response = query_engine.query(q)
print(faithfulness.evaluate_response(response=response))
print(relevancy.evaluate_response(query=q, response=response))
The results are dictionaries with per-query scores; pipe them into a hub or a notebook for visualization.
Pricing notes
LlamaIndex is free under the MIT license. LlamaCloud is the hosted offering with paid plans; pricing is sales-led for higher tiers. The eval module itself runs locally with no service dependency, so the only direct cost is the per-token bill for the LLM judge used by faithfulness, relevancy, and correctness scorers.
Alternatives
Consider Ragas for RAG-specific metrics with no LlamaIndex coupling. Consider DeepEval for pytest-style ergonomics. Consider TruLens for feedback functions that compose. For tracing and observability, pair with Arize Phoenix or Langfuse.
Citations and last verified
Verified 2026-05-29 against the LlamaIndex evaluation module guide.