跳转到内容

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

作者: Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn (2023)

arXiv: 2305.18290

TLDR(中文)

DPO(直接偏好优化)证明了 RLHF 中的奖励模型 + RL 两步可以合并为一步有监督学习: 直接在偏好数据上优化语言模型参数,数学上等价于最优 RLHF 策略。 DPO 因其简洁高效成为对齐研究和开源社区的主流替代方案。

TLDR (English)

DPO (Direct Preference Optimization) shows that the reward model + RL two-step in RLHF can be collapsed into a single supervised learning step: directly optimizing language model parameters on preference data, mathematically equivalent to the optimal RLHF policy. DPO has become the dominant RLHF alternative in alignment research and open-source community.