Deep Reinforcement Learning from Human Preferences

作者： Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei (2017)

领域

对齐

TLDR（中文）

RLHF（人类反馈强化学习）的奠基论文。作者展示了通过人类对比偏好来训练奖励模型，再用该奖励模型指导强化学习，可以让 agent 学会难以用奖励函数显式描述的复杂行为。这个框架后来被 InstructGPT/ChatGPT 直接采用。

TLDR (English)

The foundational RLHF paper. The authors show that training a reward model from human pairwise preferences, then using it to guide reinforcement learning, enables agents to learn complex behaviors that are difficult to specify with explicit reward functions. This framework was directly adopted by InstructGPT/ChatGPT.

出现在这些文章里

微调与对齐：让模型听指令、守规矩
Fine-Tuning and Alignment: Making Models Follow Instructions

同被引用

这些论文与本文出现在同一篇文章中

Deep Reinforcement Learning from Human Preferences

领域

TLDR（中文）

TLDR (English)

出现在这些文章里

同被引用

相关论文