Paper Library
95 carefully selected LLM papers, each with a bilingual TLDR.
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1 shows that o1-like chain-of-thought reasoning can emerge purely from reinforcement learning (without supervised fine-tuning warmup), using GRPO instead of PPO. Fully open-source (weights + training details), it matches OpenAI o1 on multiple reasoning benchmarks and is one of the most significant open-source LLM results of 2025.
- Model Context Protocol (MCP)
The Model Context Protocol (MCP) is an open standard proposed by Anthropic for how LLM applications communicate standardly with external tools, data sources, and services. Through unified "resources/tools/prompts" interfaces, any MCP-compatible tool can seamlessly connect to any MCP-compatible model — aiming to be the "USB standard" for AI tool use.
- The Llama 3 Herd of Models
Meta's LLaMA 3 technical report covering models from 8B to 405B parameters. Details data processing (15T tokens, multilingual), architecture improvements (GQA, extended RoPE), training pipeline (SFT + RLHF + DPO), and multimodal extension integration. LLaMA 3 405B is one of the most capable open-source LLMs available.
- Mixtral of Experts
Mixtral 8x7B is the first widely open-sourced MoE language model: 8 expert networks, each token routes to 2, so ~13B parameters are activated with 47B total. At inference cost similar to a 13B dense model, it matches or surpasses LLaMA 2 70B, proving MoE viability for open-source models.
- OpenAI o1 System Card
OpenAI o1's system card reveals the approach of training "slow thinking" models via large-scale reinforcement learning: the model performs extended internal reasoning chains before answering, dramatically outperforming GPT-4 on math competitions and coding. This marks a paradigm shift from "fast thinking" to "slow thinking" LLMs.
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
GQA (Grouped Query Attention) is a middle ground between MHA and MQA: grouping KV heads so multiple query heads share the same KV, significantly reducing KV cache memory while maintaining near-MHA quality. LLaMA 2/3, Mistral, and other major models all use GQA.
- GPT-4 Technical Report
Industry report rather than full paper, but first to explicitly use "predictable scaling" as product delivery commitment and systematically disclose safety/red-team processes. Turning point from LLM as "research demo" to "infrastructure".
- GPT-4V(ision) System Card
First production-grade multimodal LLM safety/capability disclosure document. Unifies "image + text" into ChatGPT, key step before GPT-4o's end-to-end audio/image/video.
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
DPO (Direct Preference Optimization) shows that the reward model + RL two-step in RLHF can be collapsed into a single supervised learning step: directly optimizing language model parameters on preference data, mathematically equivalent to the optimal RLHF policy. DPO has become the dominant RLHF alternative in alignment research and open-source community.
- Alpaca: A Strong, Replicable Instruction-Following Model
Uses 52K self-instruct data + LLaMA 7B to replicate GPT-3.5 style responses for $5. Launched open-source instruction tuning wave, starting point of 2023's "llama wars".
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of Thoughts (ToT) models problem solving as tree search: LLMs generate multiple "thought steps" as tree nodes, score them with an evaluator, and search with BFS/DFS. On tasks requiring complex planning (e.g., Game of 24), ToT massively outperforms CoT and is a precursor to o1-style slow thinking.
- Constitutional AI: Harmlessness from AI Feedback
Anthropic's Constitutional AI (CAI): use a set of explicit "constitution" principles to let the model self-critique and revise (SL-CAI phase), then use AI feedback instead of human feedback for RLHF (RLAIF phase). This reduces reliance on human annotation and is the core alignment technique behind the Claude model family.
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention uses IO-aware tiled computation to reduce attention memory from O(N²) to O(N) without losing precision, achieving 2-4x speedup. It fundamentally changed what's feasible for long-context training and is now an indispensable optimization in modern LLM training and inference.
- Training Compute-Optimal Large Language Models
Proposes the Chinchilla scaling laws: given a fixed compute budget, model parameters and training tokens should scale equally (challenging the prior belief that parameters matter more). Chinchilla 70B outperformed Gopher 280B, redefining optimal LLM training strategy.
- Training language models to follow instructions with human feedback
InstructGPT introduces the three-stage RLHF pipeline (SFT → reward model → PPO) that transforms language models from "predict next token" to "follow human intent." This is the direct blueprint for ChatGPT and established the dominant approach to LLM alignment.
- Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-Consistency is a key improvement to CoT: instead of greedy decoding a single reasoning chain, sample multiple diverse reasoning paths and take the most frequent answer (majority vote). This simple trick improves accuracy by 10-20 percentage points on multiple reasoning benchmarks.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Introduces chain-of-thought prompting: adding intermediate reasoning steps to prompts dramatically improves LLM performance on math, logic, and commonsense reasoning tasks. This simple technique brought LLM reasoning capabilities close to human-level performance.
- ReAct: Synergizing Reasoning and Acting in Language Models
ReAct interleaves reasoning and acting: LLM thinks (Thought), executes a tool call (Action), observes the result (Observation), and cycles. This is the prototype for modern AI agent frameworks, directly influencing LangChain, AutoGPT, and similar agent frameworks.
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformer is the first architecture to scale Transformers to trillion parameters in practice. Using Mixture-of-Experts (MoE), each token only activates a small fraction of parameters ("sparse activation"), achieving better performance than dense models at the same compute. GPT-4 and Mixtral likely use similar architectures.
- LoRA: Low-Rank Adaptation of Large Language Models
LoRA freezes pretrained weights and only trains the product of two low-rank matrices (rank r much smaller than original dimensions), reducing trainable parameters by up to 10,000x. This makes fine-tuning large models on consumer GPUs feasible and has become the dominant parameter-efficient fine-tuning (PEFT) method.
- RoFormer: Enhanced Transformer with Rotary Position Embedding
RoPE (Rotary Position Embedding) is the position encoding scheme used in most major LLMs today (LLaMA, Mistral, Qwen, etc.). By incorporating position information as rotation matrices in attention computation, it elegantly handles relative positions and generalizes much better than absolute position encoding when extrapolating to longer context lengths.
- Language Models are Few-Shot Learners
OpenAI's GPT-3 paper demonstrates that a 175B parameter language model can perform diverse tasks through few-shot in-context learning without fine-tuning. It established the paradigm that scale unlocks emergent capabilities and launched the era of prompt engineering.
- Scaling Laws for Neural Language Models
OpenAI's scaling laws paper finds that language model performance (cross-entropy loss) follows power laws with model parameters, dataset size, and compute. This enables predicting large-scale training results from small experiments and provided the theoretical basis for the LLM scale-up race, directly leading to GPT-3.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG (Retrieval-Augmented Generation) combines pretrained LMs with information retrieval: for each query, retrieve relevant documents from a knowledge base, then generate answers with the documents in context. This addresses LLM knowledge staleness and hallucination, and is now a core architecture in enterprise AI applications.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 unifies all NLP tasks into a "text-to-text" format (e.g., classification also outputs label text rather than class IDs) and systematically explores how dataset, architecture, pretraining objectives, and scale affect transfer learning. This unified paradigm became a key inspiration for instruction tuning and instruction-following models.
- Language Models are Unsupervised Multitask Learners
GPT-2 shows that a 1.5B parameter language model trained only on unlabeled web text can perform various language tasks zero-shot without fine-tuning. This challenged the convention that NLP tasks require task-specific training and famously became the first AI model "staged released" due to misuse concerns.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT uses masked language modeling (MLM) and next sentence prediction to pretrain a bidirectional Transformer on large text corpora, then fine-tunes for downstream tasks. It simultaneously surpassed SOTA on 11 NLP benchmarks, establishing the "pretrain+finetune" paradigm that dominates modern NLP.
- Deep contextualized word representations
ELMo introduced contextualized word embeddings: the same word has different vector representations in different contexts (e.g., "bank" in financial vs. riverbank contexts). Using bidirectional LSTMs, ELMo set new SOTA on multiple NLP tasks and laid the conceptual foundation for BERT and subsequent pretrained models.
- Improving Language Understanding by Generative Pre-Training (GPT-1)
OpenAI's first proposal of decoder-only + autoregressive pre-training + task fine-tuning, establishing the foundation for GPT-2/3/4 series. Less popular than BERT initially, but proven to be the winning direction years later.
- Deep Reinforcement Learning from Human Preferences
The foundational RLHF paper. The authors show that training a reward model from human pairwise preferences, then using it to guide reinforcement learning, enables agents to learn complex behaviors that are difficult to specify with explicit reward functions. This framework was directly adopted by InstructGPT/ChatGPT.
- Attention Is All You Need
The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.
- Neural Machine Translation by Jointly Learning to Align and Translate
The seminal attention mechanism paper (pre-Transformer). The authors found that seq2seq's fixed-length bottleneck vector limited translation quality, and proposed letting the decoder dynamically attend to all encoder hidden states when generating each word. This idea directly evolved into Transformer self-attention.
- Sequence to Sequence Learning with Neural Networks
The foundational seq2seq (encoder-decoder) architecture paper. Using two LSTMs in a compress-then-generate structure, it enabled neural networks to perform variable-length sequence-to-sequence transformations for the first time, achieving breakthroughs in machine translation and directly inspiring the Transformer's encoder-decoder design.
- Efficient Estimation of Word Representations in Vector Space
Word2Vec introduced the concept of word embeddings: training neural networks on large text corpora so semantically similar words cluster in vector space. The famous "king - man + woman ≈ queen" analogy demonstrated its power, laying the foundation for embedding layers in all subsequent language models.
- abdin2024-phi3
3.8B Phi-3-mini approaches GPT-3.5 on multiple benchmarks; continues to validate "high-quality synthesis + curriculum data" Phi recipe. Representative work for edge/local large models.
- ai2024-yi
Kai-Fu Lee's 01.AI full-stack technical report, emphasizing "small but strong + data quality". Yi-34B long-term first-tier open-source Chinese-English LLM, also early representative of 200K context open-source models.
- azar2023-ipo
Unifies RLHF/DPO with Ψ-PO framework, points out DPO overfits under BT assumption; proposes more robust IPO loss. Theoretical must-read for understanding "why DPO doesn't always work"; see also KTO, SimPO.
- alayrac2022-flamingo
Uses Perceiver Resampler to connect image features to frozen LLM for few-shot visual QA. Ancestor of mainstream "plug-in multimodal" approach (LLaVA, IDEFICS, etc.).
- bai2022-hh
Anthropic's early RLHF paper, HH-RLHF dataset since then became "MNIST" of open-source alignment research. Earliest systematic work understanding helpful vs harmless tension.
- bai2023-qwen
Alibaba Qwen's first complete technical report, covering 1.8B-72B full range, emphasizing Chinese-English bilingual + tokenizer friendliness. Representative foundation for Chinese open-source LLM series; subsequent Qwen2/2.5 are 2024-2025 open-source SOTA.
- borgeaud2022-retro
DeepMind introduces chunked retrieval during pre-training, making 7B model match 175B GPT-3. Proves retrieval isn't just RAG inference trick, but another possible pre-training paradigm.
- chen2021-humaneval
Proposes Codex model + HumanEval benchmark (164 programming problems). HumanEval remains "ECG metric" for coding models today; this paper is also root of GitHub Copilot.
- chen2023-longlora
Uses shifted sparse attention + LoRA to extend 7B model to 100K context with just one 8xA100 machine. Engineering benchmark for long-context fine-tuning; see also YaRN, PoSE.
- chen2023-spec-sampling
DeepMind's concurrent independent proposal of speculative sampling, theoretically proving acceleration while preserving sampling distribution. Together with Leviathan sets the direction; see also later Medusa, EAGLE.
- chowdhery2022-palm
Google's 540B model demonstrating "emergent" behaviors from larger scale (multi-step reasoning, joke explanation), first large-scale use of Pathways system. An independent engineering path after GPT-3.
- clark2020-electra
Uses replaced token detection instead of MLM, allowing small models to achieve BERT-large level performance. Representative work on "pre-training objective determines sample efficiency".
- dao2023-flashattention2
Uses more aggressive warp-level parallelism and work partitioning to double FlashAttention performance. Today vLLM/SGLang/Megatron training backends have all upgraded to FA-2.
- deepseek2024-v2
Introduces Multi-head Latent Attention (MLA) reducing KV cache to 1/13, making 236B MoE inference price crush same-tier closed-source. MLA is core source of V3/R1 inference cost-effectiveness.
- deepseek2024-v3
671B parameters (37B activated) MoE, 14.8T token training; first large-scale production LLM to run FP8 training + Multi-Token Prediction, compressing training cost to $5.6M. Shook entire industry.
- dettmers2022-llmint8
Reveals "emergent outliers" in large model activations and proposes mixed-precision solution. Core work behind bitsandbytes library, first enabling 175B models to fit in 8 A100s.
- dettmers2023-qlora
4-bit NF4 + LoRA + paged optimizer enables SFT of 65B on single 48GB GPU. Open-source community fine-tuning of LLaMA-2/3, Qwen uses this approach almost 100%.
- du2021-glam
1.2T parameter MoE achieves GPT-3 quality with 1/3 training compute, early representative of MoE "cost-effectiveness wins". Mixtral/DeepSeek-V2/V3 are its spiritual descendants.
- frantar2022-gptq
First to achieve "4-bit quantization of 175B model on single GPU with almost no accuracy loss". Lowered LLM inference hardware barrier from 8xA100 to single consumer GPU, popularizing "run open-source LLMs locally".
- gao2022-hyde
Makes LLM "pretend" to generate an answer first, then uses its embedding to retrieve real documents. Zero-shot, strong generalization—one of the most reused retrieval enhancement tricks in RAG era.
- gemini2023-team
Google's multimodal model family (Ultra/Pro/Nano), representative of "natively multimodal" narrative. 1.5 series later pushes context to 1M-10M tokens, benchmark for long-context industrial deployment.
- gunasekar2023-phi1
Microsoft uses 7B tokens of high-quality "textbook-level" synthetic data to train 1.3B model approaching GPT-3.5 on HumanEval. Takes "data quality >> data scale" story to extreme, launching Phi series.
- hendrycks2020-mmlu
57 subjects with 14K exam questions, since then "grinding MMLU" became de facto standard for measuring LLM general capability. Still first-line metric in model cards even in 2025; see also later MMLU-Pro.
- howard2018-ulmfit
First paper to explicitly propose the "universal language model pre-training → task fine-tuning" pipeline, with key tricks like discriminative LR and slanted triangular schedule. Together with ELMo, represents "the last mile before BERT".
- jiang2023-mistral7b
Uses GQA + sliding window attention to make 7B model outperform LLaMA-2 13B; first enters scene with "Apache 2.0 + direct weight release" stance. Leads European open-source LLM force.
- jimenez2024-swebench
Uses 12 real Python repos with 2294 issues to evaluate code models' "end-to-end bug solving" capability. Overnight became coding agent industry standard benchmark; almost every coding agent paper reports SWE-bench scores.
- kalchbrenner2016-bytenet
Uses dilated convolutions for seq2seq, liberating sequence modeling from "must use RNN sequential computation". Together with ConvS2S, represents the strongest attempt at parallel sequence modeling before Transformer.
- karpukhin2020-dpr
Dual-tower BERT + in-batch negatives trains first industrial-grade dense retriever, virtually eliminating BM25 overnight. Today's vector search (FAISS, pgvector) engineering paradigm solidified here.
- kim2014-textcnn
Uses CNN with pre-trained word vectors for text classification, proving "pre-trained embedding + simple architecture" beats hand-crafted features. An early sign of pre-training paradigm entering NLP.
- kojima2022-zeroshot-cot
A single phrase "Let's think step by step" boosts math accuracy from ~17% to ~78%. CoT capability is inherent in models, triggered by prompts—this discovery shocked the entire community.
- kwon2023-vllm
Introduces OS "paged memory" concept to KV cache, virtually eliminating OOM waste and multiplying throughput 2-4x. vLLM thereby becomes de facto standard open-source inference engine; compute foundation for MCP/Agent era.
- lee2023-rlaif
Google systematically proves RLAIF can match RLHF on various tasks, providing engineering evidence for "AI feedback replacing human" as scalable alignment solution.
- leviathan2023-spec-decoding
Uses small draft model to predict multiple tokens, large model verifies in one pass, achieving nearly lossless 2-3x speedup. Standard technique in all inference engines (vLLM, TensorRT-LLM) today.
- liang2022-helm
Stanford CRFM systematically evaluates 30+ LLMs × multidimensional metrics (accuracy, robustness, fairness, efficiency...), establishing "evaluation science". Representative work against "only looking at average scores".
- lin2023-awq
Discovers "few critical weights correspond to large activations", applies per-channel scaling by importance. More robust and faster than GPTQ at 4-bit, one of mainstream INT4 deployment solutions today.
- liu2019-roberta
Uses more data, longer training, removes NSP to prove BERT was far from fully trained. Important not just for stronger model, but for first clearly demonstrating that "training recipe" itself is a core research question.
- liu2023-llava
CLIP vision encoder + LLaMA + GPT-4 synthesized multimodal instruction data creates first open-source GPT-4V style model with minimal compute. Starting point for open-source multimodal ecosystem (LLaVA-1.5/1.6, Qwen-VL, InternVL).
- luong2015-attention
Systematically compares global vs local attention and different scoring functions (dot/general/concat). The most commonly cited engineering reference when explaining "how attention scores are computed".
- mikolov2013-skipgram-negsampling
The NeurIPS version of word2vec, introducing Negative Sampling, Hierarchical Softmax, and phrase-level vectors. Influenced all subsequent embedding training objectives including GloVe, fastText, and modern LLM embedding layers.
- peng2023-yarn
Applies NTK-aware interpolation + temperature correction on RoPE, extending context to 64K-128K with minimal training. Most open-source models today use YaRN or variants for length extension.
- perez2022-redteaming
DeepMind uses one LLM to automatically generate attack prompts for red-teaming another LLM, engineering red-teaming. Safety/jailbreak research since then shifted from "manual prompt search" to automated paradigm.
- press2021-alibi
Converts position information into linear bias on attention, enabling extrapolation to several times training length with zero parameters. Representative early long-context solution, competing with RoPE as two alternative approaches.
- qwen2024-qwen25
18T token pre-training, 0.5B-72B full suite + specialized Coder/Math sub-families; one of most stable open-source Chinese-English LLMs of 2024-2025. Long-term top in Hugging Face downloads/fine-tuning.
- radford2021-clip
Uses 400M image-text pairs for contrastive learning to obtain universal vision encoder. CLIP embeddings remain the vision frontend for almost all multimodal systems (DALL·E, Stable Diffusion, LLaVA) today.
- schick2023-toolformer
Makes model generate "API-calling tokens" itself and evaluate usefulness through self-supervision. Foundational paper for function-calling/tool-use training paradigm, directly influencing GPT-4 function calling design.
- shah2024-flashattention3
Leverages H100's async TMA and FP8 to push attention to 1.2 PFLOPs while maintaining numerical precision. Key dependency for long-context + FP8 training on Hopper architecture.
- shazeer2019-mqa
Proposes Multi-Query Attention: all heads share the same K/V, reducing KV cache usage to 1/h. All modern KV cache optimization and long-context inference stories start from this 5-page paper.
- shinn2023-reflexion
Makes agent do natural language "post-mortem" after failure, injecting reflection into next round's prompt. "Gradient-free self-improvement" approach widely reused in coding agents, SWE-agent.
- snell2024-test-time-compute
Systematically presents scaling laws for "spending more compute at inference time": with fixed budget, adding inference-time search to small models often more cost-effective than training larger models. Theoretical foundation for o1/R1 era.
- stiennon2020-summarize
OpenAI's first application of RLHF to large language models (summarization), proving RLHF systematically better than SFT/MLE on human preferences. Direct predecessor to InstructGPT.
- touvron2023-llama
Meta implements "small but precise + massive tokens" Chinchilla recipe and opens weights. LLaMA 1 directly catalyzed open-source LLM explosion (Alpaca/Vicuna/Mistral/Qwen all benefited).
- touvron2023-llama2
First commercially licensed high-quality open-source chat model, publicly shares RLHF recipe (PPO + GAtt). Directly advances open-source ecosystem to "near ChatGPT experience" stage.
- wang2022-self-instruct
Uses GPT-3 to generate instruction-output data and distill to itself. Stanford Alpaca/Vicuna both based on this, opening "use large models to generate data for training small models" synthetic data era.
- wei2023-jailbroken
Systematically classifies jailbreak methods (out-of-distribution, goal conflict) and explains why RLHF struggles to eradicate them. Reference material for jailbreak research "taxonomy".
- xiao2022-smoothquant
Moves activation outliers to weights through equivalent mathematical transformation, making INT8 inference feasible. Key engineering discovery enabling GPU FP8/INT8 deployment.
- yang2019-xlnet
Proposes Permutation LM to merge benefits of AR and AE, combined with Transformer-XL for long sequences. Shows "pre-training objective" is still an open question, most imaginative alternative after BERT.
- yang2024-sweagent
Proposes ACI (Agent-Computer Interface) concept, emphasizing "what tools/interface agent uses ≥ what model used". GPT-4 + good ACI improves SWE-bench 6x, establishing coding agent engineering methodology.
- zeng2022-glm130b
Tsinghua + Zhipu's open Chinese-English bilingual 130B model, earliest representative technical report of Chinese LLM industrialization. Subsequent ChatGLM-6B/9B pushed open-source Chinese dialogue to mass adoption.
- zhou2022-least-to-most
"Break hard problems into easy ones, solve sequentially" is another reasoning paradigm parallel to CoT, especially effective for compositional generalization. Together with CoT/ToT forms the trio of "how to guide LLM step-by-step thinking".
- zheng2023-mtbench
Proposes GPT-4-as-judge + human preference crowdsourcing (Chatbot Arena) for evaluating dialogue capability. MT-Bench and Arena ELO remain community's de facto dual standards for comparing model "dialogue capability" today.
- zou2023-universal-attack
Uses GCG algorithm to find gibberish suffix that breaks through aligned LLaMA-2/Vicuna, with attacks transferring across multiple closed-source models. Shocked entire security community, making "alignment fragility" mainstream topic.