跳转到内容

Deep Reinforcement Learning from Human Preferences

作者: Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei (2017)

arXiv: 1706.03741

TLDR(中文)

RLHF(人类反馈强化学习)的奠基论文。作者展示了通过人类对比偏好来训练奖励模型, 再用该奖励模型指导强化学习,可以让 agent 学会难以用奖励函数显式描述的复杂行为。 这个框架后来被 InstructGPT/ChatGPT 直接采用。

TLDR (English)

The foundational RLHF paper. The authors show that training a reward model from human pairwise preferences, then using it to guide reinforcement learning, enables agents to learn complex behaviors that are difficult to specify with explicit reward functions. This framework was directly adopted by InstructGPT/ChatGPT.