A General Theoretical Paradigm to Understand Learning from Human Preferences

作者： Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos (2023)

arXiv： 2310.12036

领域

对齐

TLDR（中文）

用 Ψ-PO 框架统一 RLHF/DPO，并指出 DPO 在 BT 假设下会过拟合；提出 IPO 损失更稳健。是理解"为什么 DPO 不总是 work"的理论必读；另见 KTO、SimPO。

TLDR (English)

Unifies RLHF/DPO with Ψ-PO framework, points out DPO overfits under BT assumption; proposes more robust IPO loss. Theoretical must-read for understanding "why DPO doesn't always work"; see also KTO, SimPO.

A General Theoretical Paradigm to Understand Learning from Human Preferences

领域

TLDR（中文）

TLDR (English)

相关论文