Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
arXiv: 2209.07858
Domains
TLDR (English)
Systematically studies red teaming methods for language models, finding that harmful output rates may decrease with scale, but models become better at circumventing human-written safety rules. Proposes best practices for scaled red teaming.
TLDR(中文)
系统研究了语言模型的红队测试方法,发现随着模型规模增大,有害输出率反而可能下降,但模型也变得更擅长绕过人类编写的安全规则。提出了规模化红队测试的最佳实践。
Related Papers
Other papers in the same domain