Native multimodal

Inputs cross modalities: image plus text Q and A, voice to text to image, screen plus instruction. Task accuracy plus cross-modal grounding, not single-modality fallback.

What "good" looks like

A native multimodal eval scores two things: task accuracy on interleaved benchmarks, and whether the model actually used information from each modality. The second axis is the one aggregate accuracy hides. A model can score well on a benchmark and still ignore image layout in your product by falling back to OCR plus text-only reasoning.

The targets to track:

Accuracy on MMMU and its higher-difficulty sibling MMMU-Pro, where text-only-solvable items were removed and the drop versus MMMU was 16.8 to 26.9% for leading models^[1].
Cross-modal grounding fidelity: for each output, is the model using information from each modality, or has it fallen back to OCR plus text-only reasoning?

Why grounding matters. A model that scores well on MMMU but ignores image layout in your real product is a failure mode, and the eval has to catch it.

The composite. Benchmark accuracy plus a domain-specific LaaJ check that asks, for each output, "which inputs informed this answer?" If the rubric can be satisfied without the image, the question is poorly designed. If the model's answer ignores the image, the model is leaking.

flowchart TD
    IN["Interleaved input: image + text, audio, screen"] --> ACC["Benchmark accuracy: MMMU, MMMU-Pro"]
    IN --> LAAJ["Grounding LaaJ: which inputs informed this answer?"]
    ACC --> GATE{"Accurate AND every modality used?"}
    LAAJ --> GATE
    LAAJ --> LEAK["No image use?"]
    LEAK -->|"model ignores image"| FIX["Model leaking: OCR + text-only fallback"]
    LEAK -->|"rubric passes without image"| DESIGN["Question poorly designed: rewrite"]
    GATE -->|yes| SHIP["Ship"]
    GATE -->|no| FIX

Secondary metrics

Metric	What it catches	How to score
Modality leakage	Model reads OCR text but ignores layout, charts, or spatial relations	LaaJ check on which modality informed the output
Hallucinated objects or text-in-image	Model "sees" something not present	LaaJ against the actual image contents
Refusal-on-image false positives	Model refuses a benign image because it triggered a safety filter	Refusal rate on a benign-image set
Per-skill accuracy	Uneven capability the aggregate hides	Benchmark splits; MM-Vet splits into 6 capabilities across 16 integrations and the breakdown is more useful than the aggregate

Offline eval design

Start with public breadth. MMMU plus MathVista plus VQAv2 for breadth, MMMU-Pro when you need a harder discriminator^[1]. MMBench's CircularEval (rotating multiple-choice options) reduces position bias and is worth running where available^[2].

Then build the domain set. On top of the public sets, build a domain-specific interleaved set: 100 paired examples per modality combination you ship (screenshot to action, audio to JSON, chart to caption). Score with a LaaJ rubric that explicitly asks which modality informed each output. This catches single-modality fallback that aggregate accuracy hides.

Add an adversarial probe. Vibe-Eval from Reka is a useful small-N adversarial set for high-difficulty image reasoning^[3].

Online eval design

Sample by modality combination, not by query in aggregate. The failure rates differ across (text+image), (text+audio), and (image-only) inputs.

Run a grounding LaaJ on samples with the prompt: "did the response use information from each input modality?"

Track refusal rate stratified by modality. Many vision safety filters trigger on benign content.

Alert on two signatures:

Per-combination accuracy drops.
The OCR-fallback signature: response cites text from the image but ignores diagrams or layout.

Top benchmarks

MMMU, https://arxiv.org/abs/2311.16502. 11,500 college-level questions across 30 subjects; the standard for college-level multimodal.
MMMU-Pro, https://arxiv.org/abs/2409.02813. Text-only-solvable items removed; the harder discriminator.
BLINK, https://arxiv.org/abs/2404.12390. 3,807 multiple-choice from 14 classical computer-vision tasks; humans 95.7%, GPT-4V 51.26% in the original paper^[4].

Tools especially suited

lmms-eval: cross-model benchmark runs at scale.
MMBench harness: CircularEval with rotating multiple-choice options.
Inspect AI: sandboxed reproducible runs with image logging.
Anthropic Console Evaluate: side-by-side prompt iteration on a fixed image set.
Hugging Face leaderboards: model selection at the macro level, but the domain set is what matters for the decision.

Common pitfalls

Pitfall	Why it bites	Fix
Single-modality fallback	The most common failure mode and the hardest to catch with aggregate accuracy	The "which modality informed this answer?" check exists for it
No fine-grained skill breakdown	An aggregate hides uneven capability	MM-Vet's 6 capability splits are more useful than an aggregate, and so are the per-skill scores on MMMU
Reporting on saturated benchmarks	VQAv2 is largely saturated and MMMU is becoming so for leading models	Use the harder Pro and BLINK splits, where current discrimination happens
Skipping the domain set	Public benchmarks correlate weakly with whether your product works on your customer's screenshots	Pair every public number with a domain golden set^[5]

Canonical reading

MMMU is the most useful single read because it remains the standard reference for college-level multimodal capability and its construction is the template most domain extensions follow^[6].

Read MMMU-Pro alongside it for the harder follow-up that closed many of the original test-leakage concerns^[1].

BLINK is the right second read for evaluating visual perception specifically, separate from visual question answering^[4].

Day-1 starter

100 paired examples per modality combination you ship in production. The domain set is non-negotiable.
A LaaJ rubric with an explicit "which inputs informed this answer?" field; bind it to your CI.
Track refusal rate stratified by modality. Many over-refusals are modality-specific and only visible when you split.

Cross-cutting links

Dimensional coverage: features, scenarios, personas: structuring the per-modality-combination test matrix.
Dimensional sampling: sampling production traffic by modality combination instead of in aggregate.
Rubric design: writing the "which inputs informed this answer?" rubric so it grades reliably.
Inspect AI: multimodal inputs in a reproducible harness.
LLM-judge calibration loop: calibrating the grounding judge before trusting it.