A single accuracy number on a single prompt format on a single option order at a single temperature is an unfalsifiable claim. The model that scored 78% might score 60% on a paraphrased prompt, 72% on a permuted option order, or 81% with stochastic sampling. Robustness testing surfaces the range so the headline number carries the uncertainty it actually has.
Sclar et al. showed that meaning-preserving format changes can move LLaMA-2-13B accuracy by up to seventy-six points on the same task 1. The change is not a different prompt; it is the same prompt with a different separator, a different option label format, a different spacing. The model behaves like a different system.
The result persists at scale. Larger models are less sensitive but still show ten-to-twenty-point swings. The correlation between models' format-sensitivity profiles is weak, so a format that helps Llama hurts Mistral.
Anthropic's "Challenges in Evaluating AI Systems" notes the related case: changing (A) (B) (C) to (1) (2) (3) shifts MMLU scores by roughly five points without changing question content 2.
FormatSpread's fix: report a range, not a point. Sample several plausible format variants, score each, report min, median, max.
def format_sweep(model_call, prompt_template, variants, n_samples=200):
scores = {}
for name, render in variants.items():
correct = 0
for example in sample_n(n_samples):
response = model_call(render(prompt_template, example))
if grade(response, example): correct += 1
scores[name] = correct / n_samples
vals = sorted(scores.values())
return {"min": vals[0], "median": vals[len(vals)//2], "max": vals[-1], "per_variant": scores}
The variant set is typically four to eight: bullet-list versus numbered, (A) (B) versus 1. 2., Markdown headers versus plain text, with and without a system message, with and without explicit "Answer:" sentinel.
Order sensitivity
For multi-option tasks (multiple-choice, pairwise preferences, multiple-document retrieval), order is a second axis of variance. The standard mitigation is to permute and average. For pairwise judging, swap the order and treat the pair as a single observation. For multiple-choice with few enough options (4 options is 24 permutations, tractable), run all permutations and report the per-permutation distribution.
For few-shot prompts, example order matters substantially. Yan documents the standard pattern: test multiple orderings or use CoT reformulation that is less order-sensitive 3. MMLU-style benchmarks report majority-vote across permutations on canonical reproduction recipes 4.
Temperature sweeps
For deterministic tasks (judging, classification, schema-filling), temperature is zero and remaining variance is GPU-level nondeterminism. For creative tasks (brainstorming, summarization with reasoning), temperature is non-zero and the score depends on sampling.
Three patterns are common.
Pass@k. Sample k times; report whether any of the k passed. Standard on code benchmarks. Decays predictably with k; catches one-shot luck.
Pass^k. Sample k times; report whether all k passed. The consistency metric. Used on agent benchmarks (tau-bench is canonical). Decays more sharply than pass@k.
Self-consistency at temperature T. Sample n times at temperature T; take the majority vote. Reduces individual-sample variance at n times the inference cost.
Reporting just pass@1 at a single temperature is the failure mode. A model at 95% pass@1 might hit 60% pass^4; the consistency tells the reader which one matters in production.
CheckList: INV, MFT, DIR
Ribeiro et al.'s CheckList is the canonical framework 5.
Minimum Functionality Test (MFT). The model must pass a basic test of the capability ("if I ask a sentiment classifier I love this product, does it return positive").
Invariance Test (INV). The model's output should not change under a meaning-preserving input perturbation ("if I swap excellent for great in a positive review, does the classifier still return positive").
Directional Expectation Test (DIR). The model's output should change in a known direction ("if I add not to a positive review, does sentiment flip negative").
Practitioners building CheckLists for commercial sentiment-analysis systems found extensively-tested production systems still had critical failures on simple INV and DIR tests. A passing benchmark score does not imply robustness.
| Test type | Property | Example |
|---|
| MFT | minimum capability | "is X positive": positive |
| INV | invariance under paraphrase | "X is great" = "X is excellent" |
| DIR | known-direction change | adding "not" flips polarity |
For LLM judges, INV tests check that meaning-preserving rubric edits do not change the verdict. For generation systems, DIR tests check that adding "in formal English" produces a formal output. IFEval is essentially a structured DIR suite for instruction-following 6.
A practical reporting template
For a release-gate claim:
- Point estimate at the canonical format, order, temperature.
- Wilson 95% interval at that setting.
- Format range: min and max across four to eight variants.
- Order range: min and max across two to four variants.
- Temperature range: pass@1 and pass^k at the production temperature.
The headline triplet most teams ship is (median across formats, min across formats, max across formats).
What not to do
Cherry-picking the best format. A team runs ten formats, reports the best, omits the others. Sclar's data shows this can inflate the headline by twenty points or more 1. Corrective: pre-register the format set.
Reporting one temperature. A team reports pass@1 at temperature 0.7, mentions consistency informally, never quantifies pass^k. Corrective: publish both.
Skipping INV. A team has a high MFT score and never tests paraphrase tolerance. Corrective: at least three INV tests per major capability, derived from real production paraphrase pairs.
What good looks like
A robustness-aware program pre-registers four-to-eight format variants per task, pre-registers order variants for any multi-option task, reports pass^k at production temperature alongside pass@1, and maintains INV and DIR suites for the top three failure categories from error analysis. The format-sweep code runs in CI on every release candidate; min, median, max are the three numbers on the release-gate dashboard.