Let's Verify Step by Step
arXiv: 2305.20050
Domains
TLDR (English)
Proposes process supervision: rewarding not just the final correct answer but also the correctness of each reasoning step. By training a verifier to evaluate each step, significantly outperforms outcome supervision (which only rewards the final result) on mathematical reasoning tasks.
TLDR(中文)
提出过程监督(Process Supervision)方法:不仅奖励最终正确答案,还奖励每一步推理的正确性。通过训练一个验证器来评估每个推理步骤,在数学推理任务上显著优于仅奖励最终结果的结果监督(Outcome Supervision)。
Related Papers
Other papers in the same domain