Twelve papers tracking how agent eval has hardened year over year. AgentBench, WebArena, SWE-bench, and GAIA established the four corners: multi-environment, browser, code, and general assistant. AgentBoard reframed scoring as progress rate over subgoals. OSWorld and tau-bench made the harness real desktops and real users. MLE-bench, SWE-Lancer, and BrowseComp pushed task realism further. AgentDojo and Cybench turn the lens onto security. ScienceAgentBench and SWE-bench Multimodal (in the methodology theme) extend the same instincts to new domains.