跳转到内容

perez2022-redteaming

arXiv: 2202.03286

TLDR(中文)

DeepMind 用一个 LLM 自动产生攻击 prompt 来红队另一个 LLM,把红队工程化。安全/越狱研究从此从"人工搜 prompt"走向自动化范式。

TLDR (English)

DeepMind uses one LLM to automatically generate attack prompts for red-teaming another LLM, engineering red-teaming. Safety/jailbreak research since then shifted from "manual prompt search" to automated paradigm.