Training language models to follow instructions with human feedback

Authors: Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe (2022)

arXiv: 2203.02155

Domains

Alignment

TLDR (English)

InstructGPT introduces the three-stage RLHF pipeline (SFT → reward model → PPO) that transforms language models from "predict next token" to "follow human intent." This is the direct blueprint for ChatGPT and established the dominant approach to LLM alignment.

TLDR（中文）

InstructGPT 论文，提出了 RLHF 三阶段训练方法（SFT → 奖励模型 → PPO 强化学习），让语言模型从"预测下一个词"转变为"按人类意图回答问题"。这是 ChatGPT 的直接前身，开创了对齐技术的主流路线。

Training language models to follow instructions with human feedback

Domains

TLDR (English)

TLDR（中文）

Appears in These Articles

Co-cited Papers

Related Papers