AI Evals › Papers

Agentic eval papers

Trajectory, outcome, and the harder benchmarks that came after AgentBench. WebArena, SWE-bench, GAIA, OSWorld, tau-bench, MLE-bench, AgentDojo, SWE-Lancer, BrowseComp, Cybench.

Twelve papers tracking how agent eval has hardened year over year. AgentBench, WebArena, SWE-bench, and GAIA established the four corners: multi-environment, browser, code, and general assistant. AgentBoard reframed scoring as progress rate over subgoals. OSWorld and tau-bench made the harness real desktops and real users. MLE-bench, SWE-Lancer, and BrowseComp pushed task realism further. AgentDojo and Cybench turn the lens onto security. ScienceAgentBench and SWE-bench Multimodal (in the methodology theme) extend the same instincts to new domains.