What "good" looks like
A multi-agent system is evaluated on two axes at once: whether the team hit its milestones, and where the collaboration broke when it did not. The primary is a milestone-based task score with partial credit, paired with a failure-mode profile of sampled traces; either one alone is uninterpretable. MultiAgentBench (MARBLE) is the reference for the first axis: it scores explicit sub-goal KPIs alongside collaboration quality, communication efficiency, and protocol compliance, so a team that finishes most of its milestones gets credit for what it finished, not a zero[1]. Two of its findings are worth carrying into your own harness: graph topology performs best in the research scenarios, and cognitive planning improves milestone achievement by about 3%[1]. Topology and planning strategy are eval variables, not fixed architecture decisions.
This page covers what is specific to N agents sharing a task: credit assignment, communication failure, orchestrator quality, and cost normalization. Pass^k, deterministic state checks, and single-agent error attribution are covered in Long-horizon planning; do not re-derive them here.
Secondary metrics
Cost-adjusted success against a strong single-agent baseline at equal token budget. Multi-agent systems often match single-agent quality at a multiple of the cost, and quality-only reporting hides it. Communication rounds per milestone. A rising round count with flat milestone completion is the signature of agents talking past each other. Per-agent error attribution. Which agent introduced the error that sank the run; there is no standard automatic metric for this, so trace annotation carries the load. Agent-as-judge gives step-level verdicts on sampled trajectories at reasonable cost and is the current best tool for scaling that annotation[2].
Offline eval design
Start from the MAST taxonomy: 14 failure modes in 3 categories (specification and system design, inter-agent misalignment, task verification), built from more than 1,600 annotated traces across 7 frameworks, with human agreement at kappa 0.88[3]. The authors ship an open LLM-as-judge annotator pipeline and the dataset, so you can profile your own traces without building a rubric from scratch[4]. Be honest about credit assignment: no standard automatic metric exists, and MAST-style trace annotation is the current best practice, not a stopgap.
For architecture, evaluate the orchestrator separately from the workers. Magentic-One is the reference design: an Orchestrator maintains a task ledger and a progress ledger over WebSurfer, FileSurfer, Coder, and ComputerTerminal workers, and the system is statistically competitive with strong single-agent systems on GAIA, AssistantBench, and WebArena[5] [6]. Score the orchestrator on task decomposition and re-planning; score workers on step execution. A failed run where the plan was sound but a worker botched a step is a different bug from a sound execution of a bad plan.
Compare topologies (star, chain, tree, graph) per scenario rather than picking one globally; MARBLE's scenario-dependent results say the best wiring changes with the task[1]. And every offline table needs an ablation row: same task, single strong agent, same budget. If that row wins, the multi-agent architecture is the regression.
Online eval design
Log full traces with per-agent spans so every message, tool call, and hand-off is attributable to the agent that produced it. Dashboards should slice by agent: per-agent error attribution, token spend per milestone, and communication-round counts per milestone. Alert on retry loops (the same agent pair exchanging near-identical messages) and on cost-per-milestone drift, which is where multi-agent budgets quietly explode. Observability frameworks for stage-attributable scoring of agentic workflows exist and are worth adopting rather than rebuilding[7]. Run periodic MAST annotation over a trace sample using the open annotator; the category tally over time is your system-design health metric[4].
Top benchmarks
MARBLE for milestone-scored scenarios across topologies. The MAST annotator for failure-mode profiling of your own traces; it is the fastest path to a taxonomy-grounded diagnosis[4]. Langfuse for production trace logging with per-agent spans and annotation queues. Inspect AI for harness-level control when you need isolated, repeatable runs of the same scenario. Agent-as-judge pipelines for step-level verdicts on samples where deterministic checks do not reach[2].
Common pitfalls
Reading end-task pass rate as model quality. The MAST result cuts the other way: most failures are system-design and coordination problems, not model problems[3]. Swapping in a stronger model to fix a specification bug is an expensive way to change nothing. Trusting the checker agent. Verification weakness is its own MAST category; a wrong answer confidently approved by a weak checker scores as success in a naive harness, so the verifier needs its own eval. Comparing runs without isolation. Agent side-effects contaminate reruns; without isolated environments plus repetition, two runs of the same task are not comparable and your deltas are noise. Reporting quality without cost. A multi-agent system that matches a single agent at several times the token spend is a worse system; the equal-budget ablation row is the honest comparison.
Canonical reading
MAST (Cemri et al., 2025) is the canonical read because it is the first systematic answer to "why do multi-agent systems fail": 14 failure modes from 1,600+ traces across 7 frameworks, annotated at kappa 0.88, with an open annotator you can run today[3]. Its core finding, that most failures trace to system design and inter-agent coordination rather than model capability, reframes the whole eval problem: profile the organization, not just the employees.
Day-1 starter
- Log 20 full multi-agent traces end to end, with every message and tool call attributed to the agent that produced it.
- Annotate those 20 traces with the MAST taxonomy using the open annotator and tally the three categories; the tally is your first diagnosis[4].
- Run the same 20 tasks through your strongest single agent and compare pass rate and total tokens. If the single agent wins the cost-adjusted comparison, fix the architecture before the prompts.
Cross-cutting links