Skip to content

zou2023-universal-attack

arXiv: 2307.15043

TLDR (English)

Uses GCG algorithm to find gibberish suffix that breaks through aligned LLaMA-2/Vicuna, with attacks transferring across multiple closed-source models. Shocked entire security community, making "alignment fragility" mainstream topic.

TLDR(中文)

用 GCG 算法找到一段乱码后缀,能把对齐过的 LLaMA-2/Vicuna 全打穿,且攻击在多个闭源模型间迁移。震撼整个安全社区,让"对齐脆弱性"成为主流话题。