Training language models to follow instructions with human feedback
arXiv: 2203.02155
TLDR (English)
InstructGPT introduces the three-stage RLHF pipeline (SFT → reward model → PPO) that transforms language models from "predict next token" to "follow human intent." This is the direct blueprint for ChatGPT and established the dominant approach to LLM alignment.
TLDR(中文)
InstructGPT 论文,提出了 RLHF 三阶段训练方法(SFT → 奖励模型 → PPO 强化学习), 让语言模型从"预测下一个词"转变为"按人类意图回答问题"。这是 ChatGPT 的直接前身, 开创了对齐技术的主流路线。