Non-English RAG eval

What to measure when your corpus and your users do not share a language with most of the literature.

Most public RAG literature evaluates English-only systems. Most production RAG systems do not serve English-only users. The gap is where teams discover, post-launch, that the embedding model collapses synonyms in Hindi, that the faithfulness judge slips into English when reasoning about Vietnamese, and that the answer reads as a translation rather than as native prose. This page lays out the dimensions you have to evaluate separately when the language is not English.

What changes outside English

Three things change. Embedding-model behavior varies across languages, sometimes drastically. Tokenization shifts: a token that costs one in English costs three or four in Devanagari or Chinese, changing both cost and effective context. Judge behavior shifts: a judge calibrated against English faithfulness labels does not generalize to other languages, and may default to scoring the English back-translation of the output rather than the output itself. Each shift is a separate failure surface, and a single end-to-end pass rate hides all three.

Retrieval: embedding and lexical

MTEB tracks per-language retrieval scores for most production embedding models, and the per-language deltas are large. A model averaging 60 on English may land at 40 on the target language; a multilingual-from-scratch model may invert the ranking ^[1]. Score your candidate embedding models on a labeled retrieval set in the target language, not on the English benchmark. The labeled set can be small (a hundred queries with gold chunks) and still distinguish a model that handles the language from one that does not.

# embed_compare.py
from sentence_transformers import SentenceTransformer
import numpy as np

CANDIDATES = [
    "intfloat/multilingual-e5-large",
    "BAAI/bge-m3",
    "Alibaba-NLP/gte-multilingual-base",
]

def recall_at_k(model_name, queries, chunks, k=5):
    model = SentenceTransformer(model_name)
    chunk_embs = model.encode([c["text"] for c in chunks], normalize_embeddings=True)
    ids = [c["id"] for c in chunks]
    hits = 0
    for q in queries:
        q_emb = model.encode([q["query"]], normalize_embeddings=True)[0]
        gold = ids.index(q["gold_chunk_id"])
        top = np.argsort(-(chunk_embs @ q_emb))[:k]
        hits += int(gold in top)
    return hits / len(queries)

for name in CANDIDATES:
    print(f"{name}: Recall@5 = {recall_at_k(name, queries, chunks):.3f}")

Same queries, same chunks, three multilingual models, one number per model. The winner on the target language is the one you ship, regardless of where it ranks on English-only leaderboards.

Lexical retrieval (BM25, hybrid) deserves a second look. Languages with rich morphology (Hungarian, Turkish, Russian) interact badly with naive tokenization; a language-aware analyzer (stemming, lemmatization, compound splitting) can recover several points of Recall@K ^[2].

Generation: faithfulness and answer relevance

Faithfulness in non-English is the same atomic-fact check, run in the target language. The pitfall is using an LLM judge that translates the output to English before scoring. The translation step adds noise, and the resulting score reflects translation quality as much as faithfulness. Pin the judge prompt to the target language and check, on a labeled subset, that the judge agrees with a native-speaker reviewer above 90 percent ^[3]. Calibration is per-language: a judge hitting 92 percent on French may sit at 78 percent on Indonesian.

Answer relevance has a related failure mode: the model produces an answer in a different language than the question. Track the per-query language match. A few-percent rate of off-language responses is often invisible to an English-speaking dev team and a major complaint in the target market.

Tokenization and cost

A Hindi sentence may cost three or four times as many tokens as the equivalent English sentence on most tokenizers. A chunk size of 500 tokens that holds 150 useful English concepts holds 70 in Hindi. Retune chunking per language, often by counting in characters or approximate concepts rather than tokens. Report per-language costs in the scorecard: a retrieval architecture that looks affordable on English benchmarks may be three times the cost in the target language.

Translation as a layer

For some products the right answer is to translate user queries to English, retrieve against an English corpus, and translate the answer back. WMT findings track machine translation quality per language pair; high-resource pairs are largely solved, many low-resource pairs are not ^[4]. The trap is silent: a 5 percent translation error rate per direction stacks to a 10 percent end-to-end floor before any retrieval or generation error. Evaluate translation as its own layer.

CAUTION

Translating the eval set to English to reuse English judges produces misleading numbers. The eval has to run in the user's language, with judges calibrated in that language, or scores will not predict user-visible quality.

The minimum non-English scorecard has six numbers per language: Recall@K, faithfulness, answer relevance, language-match rate, per-query cost, and judge-against-human agreement. All six get reported per release alongside the English numbers. A regression in any one is a release blocker ^[5].

When the system moves from answering questions to taking actions, the scorecard has to cover whole trajectories rather than single responses; that is the territory of Agentic Evals.