What "good" looks like
Safety eval has a non-negotiable two-axis structure: attack success rate (ASR) on red-team sets and over-refusal rate on benign-but-sensitive requests. Reporting one without the other is broken. A model with ASR of 0% can also refuse "how do I clean a wound" because the request looks medical-adjacent; that is useless in production. XSTest is the standard over-refusal probe and pairs naturally with HarmBench for the ASR side[1]. MLCommons AILuminate provides the industry-consortium standard with 12 hazard categories and over 24,000 prompts per language[2].
The "good" target is product-dependent. Consumer chatbots need ASR well below 5% on standard red-team sets and over-refusal rates under 10% on XSTest; coding assistants and enterprise tools may tolerate different thresholds. The two-axis trade-off should be explicit in the release report.
Secondary metrics
False-refusal rate on benign-sensitive requests, the XSTest axis. Demographic robustness, the rate at which model behavior varies with demographic attributes of the requester or target (BBQ is the canonical bias benchmark)[3]. Persona-based attacks, where the model is asked to role-play and the role-play unlocks harmful behavior; Deshpande et al. showed approximately 6x increases in toxicity for personas tied to certain political or cultural identities. Jailbreak persistence, the rate at which a successful attack remains successful across model turns or sessions.
Offline eval design
HarmBench standard subset for ASR and AILuminate practice prompts for industry-comparable scoring[2]. A private red-team set rotated quarterly; static benchmarks contaminate over time and attackers adapt. XSTest in the same pipeline for over-refusal. The HarmBench three-step pipeline (generate test cases, model completions, classifier eval) is the operational pattern[1]. Use the HarmBench classifiers (Llama-2-13b and Mistral-7b variants) for the judge stage.
For high-stakes deployments, pre-deploy external red-team engagement is the right default; Anthropic's 38,961-attempt human red-team study is the canonical methodology reference[4].
Online eval design
In production, run a classifier on inputs and outputs; log every triggered event. Sample a fraction (start at 10%) of triggered events for human review. Alert on attack-pattern surges (a new jailbreak technique often appears as a sudden spike in a specific category) and on over-refusal complaint volume (user reports of refused benign requests). Quarterly external red-team engagements are the right cadence for high-stakes products; the threat model changes faster than internal teams can keep up with.
Top benchmarks
HarmBench classifiers for the judge stage of the standard pipeline. Giskard scan for bias, security, and performance issues with a structured probe library. garak from NVIDIA for LLM red-team with a wide attack catalog. Promptfoo's red-team mode for CI-integrated adversarial testing. AdvBench's transferable-suffix attacks from Zou et al. are the canonical optimization-based jailbreak baseline[5].
Common pitfalls
ASR-only without over-refusal. Ships a useless model that refuses everything that touches a sensitive topic. The two-axis frame is non-negotiable. Static benchmarks only. Attackers adapt; quarterly private red-team rotation is the operational pattern. No multimodal or multi-turn coverage. Many attacks emerge specifically in image or audio inputs or across multi-turn coercion; single-turn text-only safety eval misses them. Treating "refused" as "safe". A model that refuses a harmful request and then immediately answers the same request rephrased has not actually refused; the harm test must be specific to the harm being measured.
Canonical reading
The HarmBench paper is the canonical reading because it operationalizes a three-step pipeline (test cases, completions, classifier) that is the practical pattern for most automated safety evaluation[1]. The Ganguli et al. Anthropic red-team paper is the canonical reference for the human red-team methodology that high-stakes deployments still require[4]. Read together: HarmBench for the automated layer, Ganguli for the human layer.
Day-1 starter
- HarmBench standard subset and XSTest in CI. The two-axis frame as the default scoring.
- An output classifier on every production response, with a 10% sample routed to human review. Alert on attack-pattern surges and on over-refusal complaints.
- A quarterly external red-team engagement for high-stakes products. The threat model evolves faster than internal teams can track; external expertise is the calibration.