Safety eval is the one place where the headline number lies the loudest. Attack success rate (ASR) on a red-team set tells you nothing useful unless you read it next to the false-refusal rate on benign-but-sensitive prompts. A model that refuses everything scores zero on ASR and is also unusable in production. HarmBench is the canonical adversarial set; XSTest is the canonical over-refusal companion. This recipe will run both and report the joint score 1 2.
Planned outline
- Install HarmBench. Clone the repo and pull the classifier weights (Llama-2-13B and Mistral-7B variants).
- Pick a subset. The full standard set is ~400 prompts. For a smoke run, take 100 from the standard set and 100 from XSTest.
- Run the model. The repo's
generate.py hits your endpoint. Logs go to a JSONL.
- Score with the HarmBench classifier. The classifier outputs a binary "is harmful" label per response.
- Run XSTest. Same model, same harness; score with the XSTest evaluator (typically GPT-4 as judge on the over-refusal rubric).
- Joint report. Two columns: ASR (lower is better) and false-refusal rate on XSTest (lower is better). A safety release moves both down, not just one 1.
What this will skip in v1
GCG-style automated jailbreaks (separate exercise; the adversarial attacks chapter will cover) 3. Persona-based attacks. Multimodal red-team prompts. Continuous adversarial training loops.
For background, the HarmBench walkthrough chapter covers the conceptual frame. The MLCommons AILuminate practice set is the companion piece for industry-grade taxonomies 4.