Safety and Adversarial: Protecting and Attacking Models

Intuition: models can be “tricked” too

LLM safety has two sides: models may “say the wrong thing” (hallucinations, bias, harmful content), and they may be tricked by malicious inputs into doing things they should not (jailbreaks, prompt injection). Like a gatekeeper who must both identify bad actors and resist deceptive tactics.

Engineering view: layered defense and continuous monitoring

Engineering defenses are typically layered:

Input layer: Filter sensitive prompts, detect known attack patterns, restrict input length and format.
Model layer: Alignment training (RLHF, Constitutional AI), refusal strategies, output classifiers.
Output layer: Post-processing filters, watermarks, fact-checking, citation verification.
System layer: Sandboxed execution, least privilege, audit logs, rate limiting.

Common attacks:

Jailbreak: Bypass safety restrictions through roleplay, encoding tricks, or logic traps.
Prompt injection: Embed malicious instructions in untrusted input (web pages, emails) to hijack model behavior.
Data extraction: Craft prompts to extract private information from training data.

There is no perfect defense. The key is defense in depth and continuous red teaming.

Research view: safety is generalizable refusal

At the research level, the core safety question is: can models learn “generalizable refusal”—not just rejecting attacks seen during training, but also defending against unseen variants? Current evidence suggests that adversarial attacks often transfer: jailbreaks discovered on one model frequently work on others.

Frontier directions include: automated red teaming (using models to attack models), provable defense bounds, mechanistic interpretability to locate harmful behavior circuits, and formal methods to verify output constraints in critical systems.

🔬 Open Research Questions

Key questions and research directions in this area:

Is there a theoretical equilibrium point in the "cat-and-mouse game" between adversarial attacks (e.g., GCG) and model alignment?

Related: zou2023 universal , wei2023 jailbroken
How should risk assessment frameworks for indirect prompt injection in real-world applications be constructed?

Related: greshake2023 notwhat
How can privacy risks from training data extraction attacks be quantified? Is differential privacy training the only solution?

Related: carlini2021 extracting

References

Jailbroken: How Does LLM Safety Training Fail? — Alexander Wei et al. (2023)
Systematically classifies jailbreak methods (out-of-distribution, goal conflict) and explains why RLHF struggles to eradicate them. Reference material for jailbreak research "taxonomy".
Red Teaming Language Models with Language Models — Ethan Perez et al. (2022)
DeepMind uses one LLM to automatically generate attack prompts for red-teaming another LLM, engineering red-teaming. Safety/jailbreak research since then shifted from "manual prompt search" to automated paradigm.
Universal and Transferable Adversarial Attacks on Aligned Language Models — Andy Zou et al. (2023)
Uses GCG algorithm to find gibberish suffix that breaks through aligned LLaMA-2/Vicuna, with attacks transferring across multiple closed-source models. Shocked entire security community, making "alignment fragility" mainstream topic.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback — Yuntao Bai et al. (2022)
Anthropic's early RLHF paper, HH-RLHF dataset since then became "MNIST" of open-source alignment research. Earliest systematic work understanding helpful vs harmless tension.
Extracting Training Data from Large Language Models — Nicholas Carlini et al. (2021)
Demonstrates the feasibility of extracting training data fragments from language models like GPT-2. Through carefully designed decoding strategies, hundreds of verbatim memorized training examples can be recovered, revealing privacy risks in large language models.
Not What You Have Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection — Kai Greshake et al. (2023)
Reveals indirect prompt injection attacks: adversaries control external data processed by LLM applications (web pages, emails, documents) to inject malicious instructions and hijack application behavior. Demonstrates attacks on Bing Chat, GitHub Copilot, and other real applications.

Safety and Adversarial: Protecting and Attacking Models

Intuition: models can be “tricked” too

Engineering view: layered defense and continuous monitoring

Research view: safety is generalizable refusal

🔬 Open Research Questions

Related Reading

Fine-Tuning and Alignment: Making Models Follow Instructions

Evaluation and Benchmarks: Judging Model Quality

Prompt Engineering: The Art of Talking to Models

RAG and Retrieval Augmentation: Giving Models External Memory

References