AI Evals › Techniques

Embedding similarity for evals

Cosine over sentence embeddings as an eval signal. When it works (retrieval, dedup, semantic regression). When it fails (open-ended quality, ranking systems).

Embedding similarity scores two pieces of text by the cosine of their embedding vectors. The embedder is a sentence-level encoder (SimCSE, BGE, E5, OpenAI text-embedding-3, Cohere embed-v3). The score lives in roughly [0.4, 1.0] for real text pairs and has operational range only on the right job.

When embedding similarity works

Three eval jobs map cleanly to cosine over sentence embeddings.

Retrieval recall in RAG. Given a query and a ranked list of retrieved chunks, embed each against a labeled gold passage; recall@K is the rate at which the gold sits in the top K. The metric is the ranking metric (recall@K, MRR, NDCG), not the cosine. The cosine is the underlying score. The retrieval metrics chapter covers the catalog.

Output deduplication. When sampling many candidates from the same prompt, near-duplicates share cosine above 0.95. Cluster on that threshold; report distinct-N. Useful for synthetic-data pipelines where lack of diversity is the failure mode.

Semantic regression detection. Compute cosine between the current run's output and the previous run's output on the same input. A drop below 0.7 flags a likely behavioral change worth a human look. A triage signal, not a quality score.

In all three jobs, the cosine value is interpreted against a calibrated threshold on the same embedder, not as an absolute quality number.

When embedding similarity fails

The seductive case that does not work: scoring generation quality by cosine to a reference. Yan documents the failure mode plainly. Embedder cosines compress the range so almost every reasonable output reads as 0.85 to 0.95 against a reasonable reference ¹. The metric loses discriminative power exactly where it would be most useful (separating a good summary from a great one).

Modern embedders are trained on contrastive objectives that pull semantically related text together. They are designed not to discriminate at the fine-grained level; that is the feature that makes them good retrievers and bad quality scorers. SimCSE and its descendants show this directly: alignment and uniformity are the two design axes, and both push the metric toward compression on within-domain pairs ².

The same compression undermines system ranking. If two summarization systems sit at 0.88 and 0.89 mean cosine, the ranking is within the noise floor.

Picking the embedder

MTEB is the operational reference for which embedder to use ³. Pick the top three on the sub-task that matches the eval job (retrieval for RAG recall, STS for semantic regression, clustering for deduplication) and benchmark them on a held-out slice of real product traffic.

Job	MTEB sub-task	Pick from
RAG retrieval	Retrieval, Reranking	top 3 on the relevant language slice
Output deduplication	STS, Clustering	top 3 on STS
Semantic regression	STS	top 3 on STS in the right language
Quality scoring	(nothing maps cleanly)	not the right job for embeddings

The hosted APIs (OpenAI text-embedding-3, Cohere embed-v3) ship with stable APIs and pay-per-token pricing; open models (BGE, E5, Nomic) self-host on a GPU. Maximum input length varies (8k to 32k tokens); cross-language performance varies enormously.

Computing cosine at scale

import numpy as np
from sentence_transformers import SentenceTransformer

def cosine_matrix(texts_a, texts_b, model):
    a = model.encode(texts_a, normalize_embeddings=True, batch_size=64, convert_to_numpy=True)
    b = model.encode(texts_b, normalize_embeddings=True, batch_size=64, convert_to_numpy=True)
    return a @ b.T

def recall_at_k(sims: np.ndarray, gold_idx: np.ndarray, k: int) -> float:
    top_k = np.argsort(-sims, axis=1)[:, :k]
    return float((top_k == gold_idx[:, None]).any(axis=1).mean())

For production-scale retrieval, the dot product runs inside a vector index (FAISS, hnswlib, ScaNN). Eval-time mechanics are the same; production adds an ANN approximation that needs its own recall calibration against an exact-search baseline.

Calibrate the threshold

The number "0.85" has no universal meaning. It is the threshold below which one embedder reads two texts as different on one distribution. Calibrate. Compute cosine on one hundred pairs of (current output, previous output) hand-labeled as identical, similar, or different. Plot the distribution. Pick the threshold for the desired FPR at the desired TPR. Recheck quarterly because embedder versions drift.

Three honest uses

Where cosine over embeddings earns its keep:

As the retrieval signal in a RAG eval. RAGAS context-precision and context-recall are computed against embedding similarity to the gold passage ⁴. The retrieval metrics chapter covers the catalog.

As a fast pre-filter before an LLM judge. When a judge call costs cents and the candidate set is large, cosine triage cuts ninety percent of trivially-different candidates before the judge sees them.

As a regression sentinel on synthetic-data pipelines. When a prompt or retrieval change drifts the output distribution, the cosine of the centroid against a held-out reference shifts. The signal triggers a human look, not an automatic decision.

In all three, cosine is a triage signal feeding a more rigorous metric. It is not the metric.

What it is not

Embedding similarity is not a hallucination detector (the embedder cannot tell a confident lie from a confident truth). It is not a faithfulness score (atomic-fact decomposition is the right pattern; see generation faithfulness). It is not a quality score (LLM-as-Judge is the right pattern). It is a fast, calibrated, distribution-bound signal of semantic proximity, and used on the right job it earns its place in the pipeline ⁵ ⁶.

Eugene Yan, "Abstractive Summarization Eval." ↩
Gao et al., SimCSE, EMNLP 2021. ↩
MTEB benchmark. ↩
Es et al., RAGAS, 2023. ↩
Eugene Yan, "LLM Evaluators." ↩
Thakur et al., BEIR, NeurIPS 2021. ↩