Safety and Adversarial: Protecting and Attacking Models
Intuition: models can be “tricked” too
Section titled “Intuition: models can be “tricked” too”LLM safety has two sides: models may “say the wrong thing” (hallucinations, bias, harmful content), and they may be tricked by malicious inputs into doing things they should not (jailbreaks, prompt injection). Like a gatekeeper who must both identify bad actors and resist deceptive tactics.
Engineering view: layered defense and continuous monitoring
Section titled “Engineering view: layered defense and continuous monitoring”Engineering defenses are typically layered:
- Input layer: Filter sensitive prompts, detect known attack patterns, restrict input length and format.
- Model layer: Alignment training (RLHF, Constitutional AI), refusal strategies, output classifiers.
- Output layer: Post-processing filters, watermarks, fact-checking, citation verification.
- System layer: Sandboxed execution, least privilege, audit logs, rate limiting.
Common attacks:
- Jailbreak: Bypass safety restrictions through roleplay, encoding tricks, or logic traps.
- Prompt injection: Embed malicious instructions in untrusted input (web pages, emails) to hijack model behavior.
- Data extraction: Craft prompts to extract private information from training data.
There is no perfect defense. The key is defense in depth and continuous red teaming.
Research view: safety is generalizable refusal
Section titled “Research view: safety is generalizable refusal”At the research level, the core safety question is: can models learn “generalizable refusal”—not just rejecting attacks seen during training, but also defending against unseen variants? Current evidence suggests that adversarial attacks often transfer: jailbreaks discovered on one model frequently work on others.
Frontier directions include: automated red teaming (using models to attack models), provable defense bounds, mechanistic interpretability to locate harmful behavior circuits, and formal methods to verify output constraints in critical systems.
References
- wei2023-jailbroken
Systematically classifies jailbreak methods (out-of-distribution, goal conflict) and explains why RLHF struggles to eradicate them. Reference material for jailbreak research "taxonomy".
- perez2022-redteaming
DeepMind uses one LLM to automatically generate attack prompts for red-teaming another LLM, engineering red-teaming. Safety/jailbreak research since then shifted from "manual prompt search" to automated paradigm.
- zou2023-universal-attack
Uses GCG algorithm to find gibberish suffix that breaks through aligned LLaMA-2/Vicuna, with attacks transferring across multiple closed-source models. Shocked entire security community, making "alignment fragility" mainstream topic.
- bai2022-hh
Anthropic's early RLHF paper, HH-RLHF dataset since then became "MNIST" of open-source alignment research. Earliest systematic work understanding helpful vs harmless tension.