Start Here for AI Research Engineers and Research Scientists

Nine steps from the agent-eval survey through publishing eval datasets responsibly. Citation-heavy, ablation tables, reproducibility checklists.

AI research engineers and research scientists own the methodology side of evals. their evals have to survive peer review, defend against contamination, support statistical claims, and contribute to a system card that frontier customers and regulators will read. This track is research-grade. it cites papers heavily, includes paper PDF and GitHub links, ships ablation tables, and assumes ML PhD-equivalent context. The nine steps cover the methodology arc from the current state of the field through publishing your own eval dataset responsibly.

The first three steps locate the field. Steps four through seven are the methodology tools you ship with each paper. The last two are about the public-facing artifacts (system card, dataset release).

The track

Agent eval survey (2025 onward) (16 min). The Yehudai et al. survey paper is the current map of agent eval ¹. Where the field is, where the gaps are, what is solved and what is open.
Benchmark contamination: detection in practice (30 min). The runnable recipe: date-window the test set against each model's training cutoff, then catch verbatim and near-duplicate leakage with an n-gram checker you can swap your own benchmark into. Jacovi et al. is the canonical methodology reference ².
Statistical methods for LLM comparisons (10 min). Wilson intervals, bootstrap, paired tests, clustered standard errors when questions group, power analysis, multiple comparisons ³. The reference page for what Anthropic's error-bars paper asks every eval report to do.
Adversarial eval with HarmBench (10 min). HarmBench's three-step pipeline (test cases, completions, classifier scoring) ⁴, a minimum-viable run, and how to read safety leaderboards honestly. Ganguli's original red-team paper is the manual-effort baseline the automated pipeline replaces ⁵.
Reproducibility checklist for AI papers (14 min). Content-hashed datasets pinned to every result, lineage for where each example came from, and the replay discipline: re-run a model from six months ago against the exact dataset version it was evaluated on. The minimum any peer reviewer should accept.
Writing a system card (14 min). The frontier-lab system card template. What goes in (capability evals, refusal rates, red-team results); what gets cut (proprietary training details, model size).
Reward hacking and evaluation-aware models (14 min). The hint-perturbation harness adapted from the CoT-faithfulness work, sandbagging and alignment faking as documented eval-time behaviours, and why the judge is part of the attack surface.
Frontier capability eval thresholds (16 min). Anthropic's Responsible Scaling Policy ⁶ and OpenAI's Preparedness Framework. Capability eval design at the frontier; the threshold-cliff vocabulary.
Publishing eval datasets responsibly (12 min). Canary insertion so future contamination of your benchmark is detectable, held-out sets as the only real defense, and the release patterns MixEval and LiveCodeBench use to survive in the open. The RULER and HarmBench releases are the right templates ⁷ ⁴.

What comes after the track

Once the nine pages are read, the natural next moves: pick the contamination detection method that matches your benchmark; write the reproducibility appendix for your next paper; draft a system card for an internal capability eval; pick a dataset release date and a license. The agentic-evals chapter and the statistics chapter are the deep-reading entry points. For RAG work, the retrieval metrics and generation metrics catalogs are the per-metric references.

TIP

Step 5 (reproducibility) is the cheapest credibility win. A paper that ships seeds, configs, and a one-line replication command outranks a paper with stronger claims that does not.

Yehudai et al., "LLM Agent Evaluation Survey." arxiv 2503.16416. ↩
Jacovi et al., "Stop Uploading Test Data in Plain Text." arxiv 2305.10160. ↩
Evan Miller, "Adding Error Bars to Evals." arxiv 2411.00640. ↩
Mazeika et al., "HarmBench." arxiv 2402.04249. ↩ ↩²
Ganguli et al., "Red Teaming Language Models." arxiv 2209.07858. ↩
Anthropic Responsible Scaling Policy. ↩
Hsieh et al., "RULER." arxiv 2404.06654. ↩