Twelve papers covering eval methodology and production observability. SPADE is the assertion-mining pattern that ties prompt versions to invariants. RewardBench, MixEval, and WildBench are the leaderboard and cost-correlation results to read before adopting any meta-benchmark. IFEval, InfoBench, and FollowBench are the instruction-following trilogy: atomic verifiable constraints, DRFR, and stacked difficulty. The 2025 work, the Yehudai survey, IBM's observability paper, and the data-centric position paper, are the references for the production end of the pipeline. ScienceAgentBench and SWE-bench Multimodal are pulled in here because their findings are about eval methodology as much as agent capability.