Embeddings and retrieval quality

Embed queries and docs, retrieve top-K relevant. NDCG@10 plus Recall@K on a domain set; MTEB rankings don't predict domain ranking.

What "good" looks like

A retrieval eval measures whether the embedding model puts the right documents near the top of the ranking for a query, scored against a labeled set from your own corpus. Public benchmark rank does not transfer, so a domain test set is required, not optional.

The metrics to report:

NDCG@10 (normalized discounted cumulative gain at rank 10): the standard when rank order matters.
Recall@K: the score that matters when downstream stages can re-rank, in which case getting the right docs in the top-K is what the embedding model owes you.
MRR (mean reciprocal rank): useful when only the first hit matters (single-document question answering, for instance).

The targets on a documentation corpus for RAG:

NDCG@10 in the 0.7+ range on a labeled query set is competitive.
Below 0.5 is product-blocking.
Always pair the absolute number with a comparison to the prior embedding model, since the good target depends on the corpus and use case.

On benchmark selection. MTEB is the canonical embedding benchmark and is genuinely useful for model selection at the macro level^[1]. The non-negotiable line is that MTEB rankings do not reliably predict domain rankings.

flowchart TD
    Q["Query set from your corpus"] --> EMB["Embedding model"]
    EMB --> TOPK["Retrieve top-K"]
    GOLD["10 gold docs per query, hard negatives included"] --> SCORE
    TOPK --> SCORE["Score NDCG@10, Recall@10, MRR per query"]
    SCORE --> PERQ{"Per-query regression vs prior model?"}
    PERQ -->|regression| FIX["Isolate the failing query slice, fix"]
    PERQ -->|clean| GATE{"NDCG@10 in 0.7+ range?"}
    GATE -->|below 0.5| BLOCK["Product-blocking"]
    GATE -->|yes| SHIP["Ship"]

Secondary metrics

Metric	What it catches	How to score
Latency P95	Slow retrieval per query	P95 latency per query
Dimension versus recall	Cost of embedding size on quality	Trade-off across dense models with different dimensions
Dense-versus-hybrid uplift	Gain from combining dense retrieval with BM25	Uplift of hybrid over dense alone
Drift over time	The same query degrading as documents are added or removed	Re-score as the corpus changes

Offline eval design

Start with breadth. Run BEIR via MTEB for breadth across retrieval task types^[2].

Always add a domain test set. Use 50 queries paired with 10 gold docs each from your actual corpus. The gold doc list should include hard negatives (documents that look similar to the gold but are wrong). Without hard negatives, you measure recall on easy cases that all models pass.

Score per query, not just in aggregate. Score NDCG@10, Recall@10, and MRR per query. Report per-query regressions, not just averages: per-query analysis is what surfaces the failure modes that aggregate scores hide.

Online eval design

Live signals to track:

CTR (click-through rate) at position
Dwell time per result
No-click rate
Rephrase rate (user retries a different query without clicking anything)

The single best signal. The most useful single online signal is the rephrase-after-no-click rate: it identifies query distributions where the embedding model is failing without requiring labels.

The calibration set. Alert on recall against a periodically labeled "must-find" set: a fixed list of queries with known-correct documents, scored monthly. The must-find set is the long-running calibration that catches drift the live metrics do not.

Top benchmarks

MTEB, https://github.com/embeddings-benchmark/mteb. Broad cross-task embedding evaluation; the standard for model selection^[1].
BEIR, https://arxiv.org/abs/2104.08663. The 18-task heterogeneous retrieval benchmark MTEB extends^[2].
Your domain test set. 50 queries, 10 gold docs each, hard negatives included. The eval that predicts production quality.

Tools especially suited

MTEB harness: parity with public scores and cross-model comparison.
SimCSE: fine-tuning baselines on contrastive learning^[3].
Ragas: retrieval-specific metrics in the RAG context.
Inspect AI: sandboxed reproducible runs across embedding-model swaps.

Common pitfalls

Pitfall	Why it bites	Fix
Picking by MTEB score alone	The single most common mistake; MTEB rankings correlate weakly with domain rankings	Pair every public number with a domain set
No hard-negative eval	NDCG@10 is biased upward because the model only has to beat random retrieval, not similar-but-wrong retrieval	Include hard negatives in the gold doc list
Forgetting that reranker eval is separate	In a two-stage system (embedding then cross-encoder reranker), the two stages have independent failure modes	Isolate each stage in the eval
Skipping the "must-find" set	Without a labeled query list re-scored monthly, drift detection is impossible	Maintain a must-find set scored monthly^[4]

Canonical reading

The BEIR paper (Thakur et al., 2021) is the canonical reference because it operationalizes the heterogeneous retrieval evaluation that MTEB then generalized to embedding models more broadly^[2]. The headline finding (rankings vary substantially across task types) is the operational thesis that justifies always pairing public benchmarks with domain sets. The MTEB repo is the practical follow-on^[1].

Day-1 starter

50 queries paired with 10 gold docs each from your corpus. Recall@10 and NDCG@10 per query. Per-query regression analysis on every embedding-model swap.
BEIR-style hard negatives. Without them, the headline scores overstate quality.
Wire to CI. Track per-query regressions, not just averages. Aggregate scores hide the failure modes that affect 5% of queries critically.

Cross-cutting links

Retrieval metrics: Recall@K, MRR, NDCG, hit-rate: the primary metrics in depth.
Eval RAG in two layers: why retrieval and generation get separate scorecards.
Embedding similarity for evals: embeddings as a measurement instrument, not just a retrieval component.
Ragas: context precision and recall scorers on top of your retriever.
Synthetic query generation for golden sets: generating the 50-query gold set without weeks of labeling.
RAG eval with Ragas: the retrieval scorecard wired into CI.