Training language models to follow instructions with human feedback
arXiv: 2203.02155
TLDR(中文)
InstructGPT 论文,提出了 RLHF 三阶段训练方法(SFT → 奖励模型 → PPO 强化学习), 让语言模型从"预测下一个词"转变为"按人类意图回答问题"。这是 ChatGPT 的直接前身, 开创了对齐技术的主流路线。
TLDR (English)
InstructGPT introduces the three-stage RLHF pipeline (SFT → reward model → PPO) that transforms language models from "predict next token" to "follow human intent." This is the direct blueprint for ChatGPT and established the dominant approach to LLM alignment.