bai2022-hh
arXiv: 2204.05862
TLDR (English)
Anthropic's early RLHF paper, HH-RLHF dataset since then became "MNIST" of open-source alignment research. Earliest systematic work understanding helpful vs harmless tension.
TLDR(中文)
Anthropic 早期 RLHF 论文,HH-RLHF 数据集自此成为开源对齐研究的"MNIST"。是理解 helpful vs harmless 张力的最早系统化工作。