DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Authors: DeepSeek-AI (2025)

Domains

ReasoningAlignment

TLDR (English)

DeepSeek-R1 shows that o1-like chain-of-thought reasoning can emerge purely from reinforcement learning (without supervised fine-tuning warmup), using GRPO instead of PPO. Fully open-source (weights + training details), it matches OpenAI o1 on multiple reasoning benchmarks and is one of the most significant open-source LLM results of 2025.

TLDR（中文）

DeepSeek-R1 展示了纯粹通过强化学习（无监督微调启动）就能涌现出类 o1 的链式推理能力，且主要使用 GRPO（组相对策略优化）而非 PPO。完全开源（权重 + 训练细节），在多项推理基准上与 OpenAI o1 持平，是 2025 年最重要的开源 LLM 成果之一。

Appears in These Articles

代码生成：模型如何写程序
提示工程：与模型对话的艺术
Code Generation: How Models Write Programs
Prompt Engineering: The Art of Talking to Models

Co-cited Papers

These papers appear in the same articles as this one

Related Papers

Other papers in the same domain