perez2022-redteaming
arXiv: 2202.03286
TLDR(中文)
DeepMind 用一个 LLM 自动产生攻击 prompt 来红队另一个 LLM,把红队工程化。安全/越狱研究从此从"人工搜 prompt"走向自动化范式。
TLDR (English)
DeepMind uses one LLM to automatically generate attack prompts for red-teaming another LLM, engineering red-teaming. Safety/jailbreak research since then shifted from "manual prompt search" to automated paradigm.