RAG Evals
A RAG system fails when retrieval misses the relevant chunk, when generation ignores the relevant chunk it was given, or when generation invents facts the chunks do not support. Each of those failures has a different fix. A retrieval miss demands a different chunking strategy, embedding model, or reranker. A generation that ignores good context demands a prompt change. A faithfulness failure demands a tighter judge and possibly atomic-fact decomposition. You cannot prescribe the right fix until you can attribute the failure to a layer.
That is the spine of every page in this section. Evaluate retrieval as its own thing with information-retrieval metrics; evaluate generation as its own thing with faithfulness metrics that operate on the supplied context; only then combine them into end-to-end scores. The Ragas paper and the production cookbooks built around it formalize the same split, and the practical literature has converged on this two-layer pattern.
Chapters: