AI research engineers and research scientists own the methodology side of evals. their evals have to survive peer review, defend against contamination, support statistical claims, and contribute to a system card that frontier customers and regulators will read. This track is research-grade. it cites papers heavily, includes paper PDF and GitHub links, ships ablation tables, and assumes ML PhD-equivalent context. The nine steps cover the methodology arc from the current state of the field through publishing your own eval dataset responsibly.
The first three steps locate the field. Steps four through seven are the methodology tools you ship with each paper. The last two are about the public-facing artifacts (system card, dataset release).
The track
- Agent eval survey (2025 onward) (16 min). The Yehudai et al. survey paper is the current map of agent eval 1. Where the field is, where the gaps are, what is solved and what is open.
- Benchmark contamination: detection and mitigation (16 min). Date-windowing, n-gram overlap, MinHash, memorization probes. Jacovi et al. is the canonical methodology reference 2.
- Statistical methods for LLM comparisons (14 min). Wilson intervals, paired-difference, clustered standard errors when questions group, power analysis 3. The five things Anthropic's error-bars paper asks every eval report to do.
- Red-team and adversarial eval primer (14 min). HarmBench's three-step pipeline (test cases, completions, classifier scoring) 4. GCG and the automated jailbreak literature. Ganguli's original red-team paper 5.
- Reproducibility checklist for AI papers (14 min). What every eval result should ship: seeds, configs, scripts, dataset content hashes, model version, prompt revisions. The minimum any peer reviewer should accept.
- Writing a system card (14 min). The frontier-lab system card template. What goes in (capability evals, refusal rates, red-team results); what gets cut (proprietary training details, model size).
- Interpretability tools for eval (14 min). Where mechanistic interpretability meets behavioral eval. What is actually shipped vs what is research-only. The CoT-faithfulness work as a worked example.
- Frontier capability eval thresholds (16 min). Anthropic's Responsible Scaling Policy 6 and OpenAI's Preparedness Framework. Capability eval design at the frontier; the threshold-cliff vocabulary.
- Publishing eval datasets responsibly (17 min). Canary strings to detect future contamination, license framing (research vs commercial), distribution norms (HuggingFace, GitHub, request-gated). The RULER and HarmBench releases are the right templates 7 4.
What comes after the track
Once the nine pages are read, the natural next moves: pick the contamination detection method that matches your benchmark; write the reproducibility appendix for your next paper; draft a system card for an internal capability eval; pick a dataset release date and a license. The agentic-evals chapter and the statistics chapter are the deep-reading entry points.
TIP
Step 5 (reproducibility) is the cheapest credibility win. A paper that ships seeds, configs, and a one-line replication command outranks a paper with stronger claims that does not.