AI Evals › Techniques

Contamination detection

N-gram matches, canary strings, perplexity tests, MMLU-CF rephrase patterns, held-out sets. How to tell whether a benchmark score reflects training-set memorization.

Contamination is the case where a model has seen the eval data during training. The headline score becomes a measure of memorization, not capability. The discipline has three layers: prove contamination when you suspect it, mitigate by using held-out or rephrased variants, and report honestly when you cannot prove either way.

This page is the compressed version; the datasets section covers construction, versioning, and contamination at chapter length.

Why contamination matters

A benchmark score is a claim about generalization. Contamination breaks the claim. A model that memorized MMLU during pretraining produces high numbers without telling you anything about novel reasoning ¹. Anthropic states the policy plainly: contamination "undermines comparisons" across organizations.

Known-contaminated benchmarks. MMLU has contaminated subsets, documented in MMLU-Redux's re-annotation work ². HumanEval has near-verbatim solutions on the open web; HumanEval+ adds tests that catch memorized-but-wrong solutions ³. GSM8K and MATH have published solutions; leaderboard scores are inflated. LiveCodeBench addresses contamination by stamping each problem with its release date and reporting per-cutoff scores ⁴.

The default position in 2026 is to treat any pre-2024 widely-cited benchmark as contaminated unless the model card explicitly demonstrates otherwise.

Five detection signals

Each has limits; the strongest evidence is concordance across two or more.

N-gram match. Compute 13-gram or 20-gram overlap between the eval set and the training corpus. Direct overlap above a few percent is hard evidence. Limit: most labs do not publish training corpora, so runnable only for open-data models (Pythia, OLMo, some Llama subsets).

Canary strings. The benchmark author plants a unique string. If a probe finds the string in the model's outputs under a "complete this sequence" prompt, the model has seen the eval set ¹.

Perplexity tests. Memorized eval items have lower perplexity than novel items from the same distribution. Compute perplexity on the eval and on a held-out reference set with matched distribution; a gap is contamination evidence.

Output-distribution tests. For multiple-choice items, memorized items show concentration on the correct answer above the model's calibration baseline.

Held-out variant comparison. Rephrase or freshly collect a variant; compare. A drop of more than 10 points on the variant relative to the headline indicates the original is contaminated. MMLU-CF and MMLU-Redux apply this corrective ².

The MMLU-CF rephrase pattern

The simplest mitigation a team can run.

Sample 100 items from the public benchmark.
Use a frontier model to rephrase: change wording, reorder, paraphrase the options.
Have the principal expert verify each rephrasing preserves the original answer.
Score the candidate model on the rephrased subset.
Compare to the score on the matched original subset.

A 10-point or larger drop on the rephrased version indicates the original is contaminated.

Canary insertion for private benchmarks

When building an internal benchmark, plant canaries before sharing it anywhere external.

import hashlib

def make_canary(secret: str, item_id: str) -> str:
    h = hashlib.sha256(f"{secret}:{item_id}".encode()).hexdigest()[:16]
    return f"CANARY_{h}_END"

def insert_canaries(items, secret):
    return [{**item, "context": f"{item['context']} ({make_canary(secret, item['id'])})"}
            for item in items]

def probe_for_canary(completion, secret, item_id):
    return make_canary(secret, item_id) in completion

The probe prompts the model with the canary prefix and checks for the planted suffix. A hit is evidence the model has seen the benchmark.

Held-out sets as the only real defense

Detection is reactive. The proactive defense is a private held-out set the model has never been able to see ⁵. Rules:

The set lives in a private repo. Access is logged. The set is never copied into a notebook, a slack message, a screenshot, or a publicly accessible chat.

Evaluations happen inside a sandboxed runner that returns only aggregate scores, not per-item completions. This prevents accidental exfiltration.

The set is rotated annually. Old items are retired and published; new items are collected and locked.

What MixEval and LiveCodeBench do

Two recent benchmarks are explicit about contamination. MixEval mixes multiple existing benchmarks, weighting them to match real user query distributions, and uses dynamic question selection ⁶. LiveCodeBench publishes each problem with its release date; a model trained before cutoff X is evaluated only on problems released after X, eliminating contamination by construction ⁴. The shared pattern is structural changes that price contamination out.

A practical detection pipeline

For a team reporting a defensible public-benchmark number:

Take the public benchmark headline. Note it as the "potentially-contaminated baseline."
Generate an MMLU-CF-style rephrased subset. Score the model on it.
If the rephrased score is within 5 points of the headline, contamination is probably small. Report both.
If the rephrased score is more than 10 points below, the headline is contaminated. Report the rephrased number as primary and the headline as baseline-for-comparison only.

Reporting template: "Model X scored 78% on MMLU (potentially contaminated baseline) and 65% on a 100-item rephrased subset constructed in May 2026; we report the rephrased number as the capability signal."

Saturation and contamination, both at once

Most pre-2024 widely-cited benchmarks are saturated and contaminated. HumanEval, MATH, MMLU, MGSM, BBH, HellaSwag, DROP, and NIAH show 90%+ from frontier models, and most show contamination evidence ³ ². The corrective for both is the same: switch to non-saturated, contamination-resistant alternatives. MMLU-Pro, GPQA Diamond, LiveCodeBench, FrontierMath, ARC-AGI-2 are the 2026-era replacements. The foundations: why LLM evals are hard chapter covers the saturation question.

What good looks like

A team reporting public-benchmark scores responsibly ships the headline number labeled as potentially contaminated, a rephrased-subset number labeled as the capability signal, a canary-planted private held-out set used for internal release decisions and never published, and a model-card statement on contamination methodology with the date of the rephrased construction. The cookbook at /cookbook/contamination-detection walks the runnable version.

Jacovi et al., contamination survey, 2023. ↩ ↩²
Gema et al., MMLU-Redux, 2024. ↩ ↩² ↩³
Liu et al., HumanEval+, 2023. ↩ ↩²
Jain et al., LiveCodeBench, 2024. ↩ ↩²
Anthropic, "Challenges in Evaluating AI Systems." ↩
Ni et al., MixEval, 2024. ↩