Constitutional AI: Harmlessness from AI Feedback

Authors: Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan (2022)

arXiv: 2212.08073

Domains

AlignmentSafety

TLDR (English)

Anthropic's Constitutional AI (CAI): use a set of explicit "constitution" principles to let the model self-critique and revise (SL-CAI phase), then use AI feedback instead of human feedback for RLHF (RLAIF phase). This reduces reliance on human annotation and is the core alignment technique behind the Claude model family.

TLDR（中文）

Anthropic 的 Constitutional AI（CAI）方法：用一组明文"宪法"原则，让模型先进行自我批评和修订（SL-CAI 阶段），再用 AI 反馈代替人类反馈做 RLHF（RLAIF 阶段）。这减少了对人工标注的依赖，是 Claude 系列模型对齐的核心技术。

Constitutional AI: Harmlessness from AI Feedback

Domains

TLDR (English)

TLDR（中文）

Appears in These Articles

Co-cited Papers

Related Papers