Public benchmarks contaminate. Any text published before a model's training cutoff has a non-zero chance of sitting in the training corpus, and the score you read on it is then a lower bound on overfitting rather than an upper bound on capability 1. The two practical defenses are date-windowing (use only problems published after the model cutoff) and overlap detection (MinHash or n-gram match against the published mirror of the benchmark). This recipe will walk both for a code-generation eval, using the LiveCodeBench pattern as the reference 2.
Planned outline
- Pick a date-windowing scheme. For code-gen, problems published after the model's training cutoff. For reasoning benchmarks, the equivalent is "problems written after the cutoff" if you can identify authorship dates.
- Build an inventory of cutoffs. A small table of
model -> cutoff_date you maintain; the recipe ships a starter table.
- Filter the test set. Drop problems before the relevant cutoff for each model under evaluation.
- Overlap check via MinHash. For each problem, compute a MinHash of the problem statement and check Jaccard similarity against a hash table of common-crawl-style mirrors of the benchmark.
- Per-model overlap rate. Report (problems before cutoff, problems after cutoff, average overlap with mirror) per model.
- Score with and without contaminated subset. Read the score on the full set; read the score on the date-windowed subset; report both. If they differ materially, the headline is the windowed number 2.
What this will skip in v1
Embedding-similarity contamination detection (more complex; better for a follow-up). Memorization probes (canary string injection during training; only applicable if you train your own models). Contamination on multimodal benchmarks (open problem).
For the background frame, see are we done with MMLU for what contamination plus benchmark noise looks like in practice 3. The Jacovi et al. paper is the canonical reading on detection methodology 1. LiveCodeBench is the most-followed worked example of date-windowing in production 2.