What "good" looks like
The primary metric is NDCG@10 (normalized discounted cumulative gain at rank 10) plus Recall@K and MRR (mean reciprocal rank) on domain-appropriate splits. NDCG@10 is the standard for retrieval where rank order matters; Recall@K matters when downstream stages can re-rank, in which case getting the right docs in the top-K is what the embedding model owes you; MRR is useful when only the first hit matters (single-document question answering, for instance).
MTEB is the canonical embedding benchmark and is genuinely useful for model selection at the macro level[1]. The non-negotiable line is that MTEB rankings do not reliably predict domain rankings. A domain test set is required, not optional.
The "good" target depends on the corpus and the use case. For RAG over a documentation corpus, NDCG@10 in the 0.7+ range on a labeled query set is competitive; below 0.5 is product-blocking. Always pair the absolute number with a comparison to the prior embedding model.
Secondary metrics
Latency P95 per query. Embedding dimension versus recall trade-off, which informs the choice between dense models with different dimensions. Dense-versus-hybrid uplift, the gain from combining dense retrieval with BM25. Drift over time as the corpus changes; the same query can degrade in retrieval quality when documents are added or removed.
Offline eval design
BEIR via MTEB for breadth across retrieval task types[2]. Always add a domain test set: 50 queries paired with 10 gold docs each from your actual corpus. The gold doc list should include hard negatives (documents that look similar to the gold but are wrong); without hard negatives, you measure recall on easy cases that all models pass.
Score NDCG@10, Recall@10, and MRR per query. Report per-query regressions, not just averages; per-query analysis is what surfaces the failure modes that aggregate scores hide.
Online eval design
CTR (click-through rate) at position, dwell time per result, no-click rate, and rephrase rate (user retries a different query without clicking anything). The most useful single online signal is the rephrase-after-no-click rate: it identifies query distributions where the embedding model is failing without requiring labels. Alert on recall against a periodically labeled "must-find" set: a fixed list of queries with known-correct documents, scored monthly. The must-find set is the long-running calibration that catches drift the live metrics do not.
Top benchmarks
The MTEB harness for parity with public scores and for cross-model comparison. SimCSE for fine-tuning baselines on contrastive learning[3]. Ragas for retrieval-specific metrics in the RAG context. Inspect AI for sandboxed reproducible runs across embedding-model swaps.
Common pitfalls
Picking by MTEB score alone. The single most common mistake; MTEB rankings correlate weakly with domain rankings. Pair every public number with a domain set. No hard-negative eval. Without hard negatives, NDCG@10 is biased upward because the model only has to beat random retrieval, not similar-but-wrong retrieval. Forgetting that reranker eval is separate. If you ship a two-stage retrieval system (embedding then cross-encoder reranker), the two stages have independent failure modes and the eval should isolate each. Skipping the "must-find" set. Without a labeled query list re-scored monthly, drift detection is impossible[4].
Canonical reading
The BEIR paper (Thakur et al., 2021) is the canonical reference because it operationalizes the heterogeneous retrieval evaluation that MTEB then generalized to embedding models more broadly[2]. The headline finding (rankings vary substantially across task types) is the operational thesis that justifies always pairing public benchmarks with domain sets. The MTEB repo is the practical follow-on[1].
Day-1 starter
- 50 queries paired with 10 gold docs each from your corpus. Recall@10 and NDCG@10 per query. Per-query regression analysis on every embedding-model swap.
- BEIR-style hard negatives. Without them, the headline scores overstate quality.
- Wire to CI. Track per-query regressions, not just averages. Aggregate scores hide the failure modes that affect 5% of queries critically.