What "good" looks like
The "right" metric for native multimodal is task accuracy on interleaved benchmarks plus cross-modal grounding fidelity. The first part is satisfied by accuracy on MMMU and its higher-difficulty sibling MMMU-Pro, where text-only-solvable items were removed and the drop versus MMMU was 16.8 to 26.9% for leading models[1]. The second part is whether the model is actually using information from each modality in its output, or whether it has fallen back to OCR plus text-only reasoning. A model that scores well on MMMU but ignores image layout in your real product is a failure mode; the eval has to catch it.
The composite picture is benchmark accuracy plus a domain-specific LaaJ check that asks, for each output, "which inputs informed this answer?" If the rubric can be satisfied without the image, the question is poorly designed; if the model's answer ignores the image, the model is leaking.
Secondary metrics
Modality leakage. The model reads OCR text in the image but ignores layout, charts, or spatial relations. Hallucinated objects or text-in-image, where the model "sees" something not present. Refusal-on-image false positives, the model refuses a benign image because it triggered a safety filter. Per-skill accuracy where the benchmark supports it; MM-Vet splits into 6 capabilities across 16 integrations and the breakdown is more useful than the aggregate.
Offline eval design
MMMU plus MathVista plus VQAv2 for breadth; MMMU-Pro when you need a harder discriminator[1]. MMBench's CircularEval (rotating multiple-choice options) reduces position bias and is worth running where available[2]. On top of the public sets, build a domain-specific interleaved set: 100 paired examples per modality combination you ship (screenshot to action, audio to JSON, chart to caption). Score with a LaaJ rubric that explicitly asks which modality informed each output; this catches single-modality fallback that aggregate accuracy hides. Vibe-Eval from Reka is a useful small-N adversarial set for high-difficulty image reasoning[3].
Online eval design
Sample by modality combination, not by query in aggregate. The failure rates differ across (text+image), (text+audio), and (image-only) inputs. Run a LaaJ on samples with the prompt: "did the response use information from each input modality?" Track refusal rate stratified by modality; many vision safety filters trigger on benign content. Alert on per-combination accuracy drops and on the OCR-fallback signature (response cites text from the image but ignores diagrams or layout).
Top benchmarks
lmms-eval for cross-model benchmark runs at scale. The MMBench harness for CircularEval. Inspect AI for sandboxed reproducible runs with image logging. Anthropic Console Evaluate for side-by-side prompt iteration on a fixed image set. Hugging Face leaderboards are useful for model selection at the macro level, but the domain set is what matters for the decision.
Common pitfalls
Single-modality fallback. The most common failure mode and the hardest to catch with aggregate accuracy. The "which modality informed this answer?" check exists for it. No fine-grained skill breakdown. MM-Vet's 6 capability splits are more useful than an aggregate, and so are the per-skill scores on MMMU. Reporting on saturated benchmarks. VQAv2 is largely saturated; MMMU is becoming so for leading models; the harder Pro and BLINK splits are where current discrimination happens. Skipping the domain set. Public benchmarks correlate weakly with whether your product works on your customer's screenshots; pair every public number with a domain golden set[5].
Canonical reading
MMMU is the most useful single read because it remains the standard reference for college-level multimodal capability and its construction is the template most domain extensions follow[6]. Read MMMU-Pro alongside it for the harder follow-up that closed many of the original test-leakage concerns[1]. BLINK is the right second read for evaluating visual perception specifically, separate from visual question answering[4].
Day-1 starter
- 100 paired examples per modality combination you ship in production. The domain set is non-negotiable.
- A LaaJ rubric with an explicit "which inputs informed this answer?" field; bind it to your CI.
- Track refusal rate stratified by modality. Many over-refusals are modality-specific and only visible when you split.