Jailbreaks and defenses

GCG-style adversarial suffixes, the empirical floor on defense success, and what your guardrail stack should actually do.

A jailbreak is an input crafted to bypass a model's safety training. The empirical floor on defense, established by the GCG paper and reproduced repeatedly since, is that determined attackers with white-box access to a target can produce attacks that transfer to black-box deployments at rates high enough that "we are safe because the attacker cannot see our weights" is not a working assumption ¹. Defense is therefore layered, probabilistic, and continuous, not a one-time hardening.

The argument: stop treating jailbreaks as a problem to be solved by the model alone. The defensive stack is the model's training, an input filter, an output filter, and a runtime policy. Each layer fails some fraction of attacks. The product of failure rates is your residual risk, which is the number you measure on the standing adversarial corpus.

The classes of jailbreak

Three broad categories cover most of what shows up in production.

Social-engineering prompts. The classic "DAN" pattern, role-play setups, hypothetical-framing prompts, and authority-figure impersonation. These exploit the model's instruction-following bias. They require no special tools and are produced in volume by manual red teams and crowdworker bounties. Ganguli et al.'s 38,961-attempt corpus is dominated by this class ².

Encoding and obfuscation. Base64-encoded payloads, ROT13, Unicode confusables, prompt-stuffing with foreign-language equivalents, code-block wrapping. These exploit the gap between the model's safety training (largely English natural-language) and the actual input distribution. They are easy to enumerate and easy to defend against once you know the patterns, but new variants appear continuously.

Optimization-based suffixes. GCG and its descendants compute adversarial suffixes against a white-box target model that, when appended to a harmful request, produce compliance with high probability ¹. The expensive part is the optimization; the cheap part is that the resulting suffixes often transfer to other models, including closed-weight production models. HarmBench reports per-attack-method ASR numbers across the major attack families, with GCG, AutoDAN, and PAIR as the headline optimization attacks ³.

A fourth, rapidly growing category is indirect prompt injection through tools and retrieval. The OWASP Top 10 calls this LLM01 ⁴. The attack lives in a document, a webpage, or a tool result rather than in the user's prompt directly. The model is then convinced to follow instructions from that document. This is not strictly a jailbreak in the social-engineering sense, but it is the failure mode that most affects agentic systems.

The layered defense, named honestly

Four layers, four jobs.

Layer 1: Model training. RLHF, constitutional AI, refusal-tuned datasets. The base layer; it catches the easy cases and nothing more. Every public benchmark exists because models alone are insufficient.

Layer 2: Input filter. A classifier that runs on every user input, looking for known attack patterns: jailbreak prompts, encoded payloads, suspicious instruction structures. Fast, cheap, easy to update. Catches the obvious. Easy to bypass by paraphrase or novel attack class.

Layer 3: Output filter. A classifier or LLM judge that runs on model output, looking for harmful content regardless of how it was elicited. Stronger than input filtering because it grades the actual artifact, not a precursor. Slower and more expensive; runs on every response. The right place to invest if you can only build one filter.

Layer 4: Runtime policy. For agentic systems specifically: tool-call allow-listing, scope limits, capability gating by user role, rate limits on sensitive actions. Catches the post-jailbreak case where the model is doing something bad through tools rather than through the response itself.

NOTE

No single layer is sufficient. The honest residual risk is roughly the product of each layer's miss rate. A model with 10% ASR, behind an input filter that misses 30% of attacks, behind an output filter that misses 30% of harmful outputs, gives a 0.10 times 0.30 times 0.30 = 0.9% residual rate against the standing adversarial corpus. Treat that number, not any single layer's number, as your safety claim.

Measuring defense, not refusal

The metric that matches the layered model is not "does the model refuse" but "does anything leave the system that the rubric labels harmful." Run your adversarial corpus end-to-end through input filter, model, and output filter. Score the final output. The number is the system's ASR; it should be substantially below the model-alone ASR if the filters work.

Two diagnostics let you see which layer is doing the work. First, the input-filter trip rate: how often the input filter alone catches the attack. Second, the output-filter trip rate among inputs that the input filter passed. If the input filter is doing all the catching, the output filter is dormant and you should test it directly. If the output filter is doing all the catching, the input filter may not be worth the latency cost.

"""
Per-layer diagnostic for a two-filter defense around a model.
"""
from dataclasses import dataclass

@dataclass
class Defense:
    input_filter_trip: bool
    model_output: str | None
    output_filter_trip: bool
    final_harmful: bool

def diagnose(defenses: list[Defense]) -> dict:
    n = len(defenses)
    input_trips = sum(1 for d in defenses if d.input_filter_trip)
    # Of inputs the input filter let through, how many did the output filter catch?
    passed_input = [d for d in defenses if not d.input_filter_trip]
    output_trips_on_pass = sum(1 for d in passed_input if d.output_filter_trip)
    # System ASR: harmful content that got through everything.
    system_asr = sum(1 for d in defenses if d.final_harmful) / n
    return {
        "n": n,
        "input_filter_trip_rate": input_trips / n,
        "output_filter_trip_rate_on_input_pass": (
            output_trips_on_pass / len(passed_input) if passed_input else 0.0
        ),
        "system_asr": system_asr,
    }

The transferable-suffix problem, restated

The unwelcome empirical fact about GCG-style attacks is that they transfer ¹. An attacker can optimize against a public open-weight model, then take the resulting suffix and use it against your closed-weight production model with non-trivial success rate. There is no way to fully defend against this through closed-weight model training alone, because the attacker is not attacking you; they are attacking a proxy that approximates you.

The operational response is to (a) archive transferable suffixes you find from any source as regression tests, (b) layer defenses, and (c) accept a non-zero residual rate as the cost of being open to general inputs. If your product cannot tolerate any nonzero rate of harmful output, the architecture choice is not "make the model safer" but "do not connect this model to that surface."

What to do this week

Take a small adversarial set (twenty cases mixed across the three classes above). Run it end to end through your current defensive stack. Compute system ASR.
Compute the per-layer trip rates. If one layer is doing all the work, the others are not earning their latency cost.
Pick five suffix-style attacks from HarmBench. Test whether they transfer to your production model. Any that succeed become regression tests for the next release.

That gives you a baseline number, a layer-by-layer audit, and three new regression cases. The standing program described in Designing a red-team program keeps the work going from there.

System ASR numbers and archived regression suffixes are also exactly the evidence a regulator, customer, or board will eventually ask for, and packaging them into that record is the subject of Governance, Risk, Compliance.

Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" (2023). https://arxiv.org/abs/2307.15043 ↩ ↩² ↩³
Ganguli et al., "Red Teaming Language Models" (2022). https://arxiv.org/abs/2209.07858 ↩
Mazeika et al., "HarmBench" (2024). https://arxiv.org/abs/2402.04249 ↩
OWASP, "Top 10 for LLM Applications." https://owasp.org/www-project-top-10-for-large-language-model-applications/ ↩