AI Evals › Techniques

Robustness testing

Format sweeps, order sweeps, temperature sweeps, CheckList invariance and directional tests. How to surface the variance hiding behind a single accuracy number.

A single accuracy number on a single prompt format on a single option order at a single temperature is an unfalsifiable claim. The model that scored 78% might score 60% on a paraphrased prompt, 72% on a permuted option order, or 81% with stochastic sampling. Robustness testing surfaces the range so the headline number carries the uncertainty it actually has.

For the chapter-length treatment, including the adversarial side, see safety and red-teaming.

Format sensitivity is the big one

Sclar et al. showed that meaning-preserving format changes can move LLaMA-2-13B accuracy by up to seventy-six points on the same task ¹. The change is not a different prompt; it is the same prompt with a different separator, a different option label format, a different spacing. The model behaves like a different system.

The result persists at scale. Larger models are less sensitive but still show ten-to-twenty-point swings. The correlation between models' format-sensitivity profiles is weak, so a format that helps Llama hurts Mistral.

Anthropic's "Challenges in Evaluating AI Systems" notes the related case: changing (A) (B) (C) to (1) (2) (3) shifts MMLU scores by roughly five points without changing question content ².

FormatSpread's fix: report a range, not a point. Sample several plausible format variants, score each, report min, median, max.

def format_sweep(model_call, prompt_template, variants, n_samples=200):
    scores = {}
    for name, render in variants.items():
        correct = 0
        for example in sample_n(n_samples):
            response = model_call(render(prompt_template, example))
            if grade(response, example): correct += 1
        scores[name] = correct / n_samples
    vals = sorted(scores.values())
    return {"min": vals[0], "median": vals[len(vals)//2], "max": vals[-1], "per_variant": scores}

The variant set is typically four to eight: bullet-list versus numbered, (A) (B) versus 1. 2., Markdown headers versus plain text, with and without a system message, with and without explicit "Answer:" sentinel.

Order sensitivity

For multi-option tasks (multiple-choice, pairwise preferences, multiple-document retrieval), order is a second axis of variance. The standard mitigation is to permute and average. For pairwise judging, swap the order and treat the pair as a single observation. For multiple-choice with few enough options (4 options is 24 permutations, tractable), run all permutations and report the per-permutation distribution.

For few-shot prompts, example order matters substantially. Yan documents the standard pattern: test multiple orderings or use CoT reformulation that is less order-sensitive ³. MMLU-style benchmarks report majority-vote across permutations on canonical reproduction recipes ⁴.

Temperature sweeps

For deterministic tasks (judging, classification, schema-filling), temperature is zero and remaining variance is GPU-level nondeterminism. For creative tasks (brainstorming, summarization with reasoning), temperature is non-zero and the score depends on sampling.

Three patterns are common.

Pass@k. Sample k times; report whether any of the k passed. Standard on code benchmarks. Decays predictably with k; catches one-shot luck.

Pass^k. Sample k times; report whether all k passed. The consistency metric. Used on agent benchmarks (tau-bench is canonical). Decays more sharply than pass@k.

Self-consistency at temperature T. Sample n times at temperature T; take the majority vote. Reduces individual-sample variance at n times the inference cost.

Reporting just pass@1 at a single temperature is the failure mode. A model at 95% pass@1 might hit 60% pass^4; the consistency tells the reader which one matters in production.

CheckList: INV, MFT, DIR

Ribeiro et al.'s CheckList is the canonical framework ⁵.

Minimum Functionality Test (MFT). The model must pass a basic test of the capability ("if I ask a sentiment classifier I love this product, does it return positive").

Invariance Test (INV). The model's output should not change under a meaning-preserving input perturbation ("if I swap excellent for great in a positive review, does the classifier still return positive").

Directional Expectation Test (DIR). The model's output should change in a known direction ("if I add not to a positive review, does sentiment flip negative").

Practitioners building CheckLists for commercial sentiment-analysis systems found extensively-tested production systems still had critical failures on simple INV and DIR tests. A passing benchmark score does not imply robustness.

Test type	Property	Example
MFT	minimum capability	"is X positive": positive
INV	invariance under paraphrase	"X is great" = "X is excellent"
DIR	known-direction change	adding "not" flips polarity

For LLM judges, INV tests check that meaning-preserving rubric edits do not change the verdict. For generation systems, DIR tests check that adding "in formal English" produces a formal output. IFEval is essentially a structured DIR suite for instruction-following ⁶.

A practical reporting template

For a release-gate claim:

Point estimate at the canonical format, order, temperature.
Wilson 95% interval at that setting.
Format range: min and max across four to eight variants.
Order range: min and max across two to four variants.
Temperature range: pass@1 and pass^k at the production temperature.

The headline triplet most teams ship is (median across formats, min across formats, max across formats).

What not to do

Cherry-picking the best format. A team runs ten formats, reports the best, omits the others. Sclar's data shows this can inflate the headline by twenty points or more ¹. Corrective: pre-register the format set.

Reporting one temperature. A team reports pass@1 at temperature 0.7, mentions consistency informally, never quantifies pass^k. Corrective: publish both.

Skipping INV. A team has a high MFT score and never tests paraphrase tolerance. Corrective: at least three INV tests per major capability, derived from real production paraphrase pairs.

What good looks like

A robustness-aware program pre-registers four-to-eight format variants per task, pre-registers order variants for any multi-option task, reports pass^k at production temperature alongside pass@1, and maintains INV and DIR suites for the top three failure categories from error analysis. The format-sweep code runs in CI on every release candidate; min, median, max are the three numbers on the release-gate dashboard.

Sclar et al., format sensitivity, ICLR 2024. ↩ ↩²
Anthropic, "Challenges in Evaluating AI Systems." ↩
Eugene Yan, "LLM Evaluators." ↩
Hendrycks et al., MMLU, ICLR 2021. ↩
Ribeiro et al., CheckList, ACL 2020. ↩
Zhou et al., IFEval, 2023. ↩