Safety, refusal, and jailbreak resistance

Refuse harmful requests; resist adversarial prompts; don't over-refuse. Attack success rate plus over-refusal rate, never one alone.

What "good" looks like

A safety eval scores two axes at once: attack success rate (ASR) on red-team sets, and over-refusal rate on benign-but-sensitive requests. Reporting either alone is broken.

The concrete failure: a model reaches ASR of 0% by refusing "how do I clean a wound" because it looks medical-adjacent. Useless in production.

The "good" target is product-dependent:

Consumer chatbots: ASR well below 5% on standard red-team sets, over-refusal under 10% on XSTest.
Coding assistants and enterprise tools: may tolerate different thresholds.
The two-axis trade-off should be explicit in the release report.

XSTest is the standard over-refusal probe and pairs naturally with HarmBench for the ASR side^[1]. MLCommons AILuminate provides the industry-consortium standard with 12 hazard categories and over 24,000 prompts per language^[2].

flowchart TD
    REQ["Incoming request"] --> CLS{"Harmful or benign-sensitive?"}
    CLS -->|harmful| ASR["ASR check: HarmBench red-team set"]
    CLS -->|benign-sensitive| OVR["Over-refusal check: XSTest"]
    ASR --> G1{"ASR below 5%?"}
    OVR --> G2{"Over-refusal below 10%?"}
    G1 -->|no| FIX1["Harden refusals"]
    G2 -->|no| FIX2["Loosen over-broad refusals"]
    G1 -->|yes| GATE{"Both axes pass?"}
    G2 -->|yes| GATE
    GATE -->|yes| SHIP["Ship"]

Secondary metrics

Metric	What it catches	How to score
False-refusal rate	Benign-sensitive requests wrongly refused	The XSTest axis
Demographic robustness	Behavior varying with demographic attributes of requester or target	BBQ, the canonical bias benchmark^[3]
Persona-based attacks	Role-play that opens the door to harmful behavior	Deshpande et al. showed approximately 6x increases in toxicity for personas tied to certain political or cultural identities
Jailbreak persistence	A successful attack surviving across turns	Rate at which the attack remains successful over model turns or sessions

Offline eval design

Core sets. Run the HarmBench standard subset for ASR and AILuminate practice prompts for industry-comparable scoring^[2]. Keep XSTest in the pipeline for over-refusal.

Rotate the private set. Add a private red-team set rotated quarterly; static benchmarks contaminate over time and attackers adapt.

The operational pattern. The HarmBench three-step pipeline is the pattern to copy: generate test cases, model completions, classifier eval^[1]. Use the HarmBench classifiers (Llama-2-13b and Mistral-7b variants) for the judge stage.

High-stakes deployments. Pre-deploy external red-team engagement is the right default; Anthropic's 38,961-attempt human red-team study is the canonical methodology reference^[4].

Online eval design

Classify and log. In production, run a classifier on inputs and outputs and log every triggered event. Sample a fraction (start at 10%) for review.

Alert on two signals:

Attack-pattern surges: a new jailbreak technique often appears as a sudden spike in one category.
Over-refusal complaint volume: user reports of refused benign requests.

Cadence. Quarterly external red-team engagements are the right cadence for high-stakes products; the threat model changes faster than internal teams can keep up with.

Top benchmarks

HarmBench, https://github.com/centerforaisafety/HarmBench. 18 attack methods across 33 LLMs; the academic standard for ASR measurement.
MLCommons AILuminate, https://mlcommons.org/benchmarks/ailuminate/. 12 hazard categories, over 24,000 prompts per language; the industry consortium standard.
XSTest for over-refusal. Pair with HarmBench for the two-axis scoring.

Tools especially suited

HarmBench classifiers: the judge stage of the standard pipeline.
Giskard scan: bias, security, and performance issues with a structured probe library.
garak (NVIDIA): LLM red-team with a wide attack catalog.
Promptfoo red-team mode: CI-integrated adversarial testing.
AdvBench: transferable-suffix attacks from Zou et al., the canonical optimization-based jailbreak baseline^[5].

Common pitfalls

Pitfall	Why it bites	Fix
ASR-only without over-refusal	Ships a useless model that refuses everything touching a sensitive topic	Keep the two-axis frame; it is non-negotiable
Static benchmarks only	Attackers adapt and static sets contaminate	Quarterly private red-team rotation
No multimodal or multi-turn coverage	Many attacks emerge in image or audio inputs or across multi-turn coercion, which single-turn text-only eval misses	Cover image, audio, and multi-turn attacks
Treating "refused" as "safe"	A model that refuses then answers the same request rephrased has not actually refused	Make the harm test specific to the harm being measured

Canonical reading

The HarmBench paper is the canonical reading: it operationalizes a three-step pipeline (test cases, completions, classifier) that is the practical pattern for most automated safety evaluation^[1].

The Ganguli et al. Anthropic red-team paper is the canonical reference for the human red-team methodology that high-stakes deployments still require^[4]. Read together: HarmBench for the automated layer, Ganguli for the human layer.

Day-1 starter

HarmBench standard subset and XSTest in CI. The two-axis frame as the default scoring.
An output classifier on every production response, with a 10% sample routed to human review. Alert on attack-pattern surges and on over-refusal complaints.
A quarterly external red-team engagement for high-stakes products. The threat model evolves faster than internal teams can track; external expertise is the calibration.

Cross-cutting links

HarmBench in practice: the chapter-length walkthrough behind the primary metric.
Over-refusal and the balance question: the second axis, in depth.
Robustness testing: perturbation and adversarial-input methodology.
Promptfoo: red-team probes runnable in CI.
Giskard: automated safety and bias scans.
HarmBench red-team run: the day-1 starter; standard subset plus the over-refusal check.