Paper Library

📊 By Year

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

2025

🏷️ By Domain (Top 5)

Pretraining

Architecture

Inference

Alignment

Applications

Showing 112 / 112

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI (2025)

ReasoningAlignment

Prerequisites: Let's Verify Step by Step

DeepSeek-R1 shows that o1-like chain-of-thought reasoning can emerge purely from reinforcement learning (without supervised fine-tuning warmup), using GRPO instead of PPO. Fully open-source (weights + training details), it matches OpenAI o1 on multiple reasoning benchmarks and is one of the most significant open-source LLM results of 2025.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone — Microsoft (2024)

Pretraining

Prerequisites: Textbooks Are All You Need

3.8B Phi-3-mini approaches GPT-3.5 on multiple benchmarks; continues to validate "high-quality synthesis + curriculum data" Phi recipe. Representative work for edge/local large models.
Yi: Open Foundation Models by 01.AI — 01. AI et al. (2024)

Pretraining

Prerequisites: LLaMA: Open and Efficient Foundation Language Models

Kai-Fu Lee's 01.AI full-stack technical report, emphasizing "small but strong + data quality". Yi-34B long-term first-tier open-source Chinese-English LLM, also early representative of 200K context open-source models.
Model Context Protocol (MCP) — Anthropic (2024)

Applications

Prerequisites: Toolformer: Language Models Can Teach Themselves to Use Tools

The Model Context Protocol (MCP) is an open standard proposed by Anthropic for how LLM applications communicate standardly with external tools, data sources, and services. Through unified "resources/tools/prompts" interfaces, any MCP-compatible tool can seamlessly connect to any MCP-compatible model — aiming to be the "USB standard" for AI tool use.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — DeepSeek-AI (2024)

ArchitectureInference

Prerequisites: Mixtral of Experts

Introduces Multi-head Latent Attention (MLA) reducing KV cache to 1/13, making 236B MoE inference price crush same-tier closed-source. MLA is core source of V3/R1 inference cost-effectiveness.
DeepSeek-V3 Technical Report — DeepSeek-AI (2024)

ArchitectureMixture of Experts

Prerequisites: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

671B parameters (37B activated) MoE, 14.8T token training; first large-scale production LLM to run FP8 training + Multi-Token Prediction, compressing training cost to $5.6M. Shook entire industry.
The Llama 3 Herd of Models — Meta AI (2024)

Pretraining

Prerequisites: Llama 2: Open Foundation and Fine-Tuned Chat Models

Meta's LLaMA 3 technical report covering models from 8B to 405B parameters. Details data processing (15T tokens, multilingual), architecture improvements (GQA, extended RoPE), training pipeline (SFT + RLHF + DPO), and multimodal extension integration. LLaMA 3 405B is one of the most capable open-source LLMs available.
KTO: Model Alignment as Prospect Theoretic Optimization — Kawin Ethayarajh et al. (2024)

Alignment

Prerequisites: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Proposes KTO (Kahneman-Tversky Optimization), which aligns models using only binary feedback (good/bad) without requiring paired preference data like DPO. Introduces prospect theory into alignment optimization, proving that knowing whether a single output is desirable is sufficient to learn human preferences.
Mixtral of Experts — Albert Q. Jiang et al. (2024)

Mixture of ExpertsArchitecture

Prerequisites: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Mixtral 8x7B is the first widely open-sourced MoE language model: 8 expert networks, each token routes to 2, so ~13B parameters are activated with 47B total. At inference cost similar to a 13B dense model, it matches or surpasses LLaMA 2 70B, proving MoE viability for open-source models.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Carlos E. Jimenez et al. (2024)

EvaluationApplications

Prerequisites: Evaluating Large Language Models Trained on Code

Uses 12 real Python repos with 2294 issues to evaluate code models' "end-to-end bug solving" capability. Overnight became coding agent industry standard benchmark; almost every coding agent paper reports SWE-bench scores.
OpenAI o1 System Card — OpenAI (2024)

ReasoningSafety

Prerequisites: GPT-4 Technical Report

OpenAI o1's system card reveals the approach of training "slow thinking" models via large-scale reinforcement learning: the model performs extended internal reasoning chains before answering, dramatically outperforming GPT-4 on math competitions and coding. This marks a paradigm shift from "fast thinking" to "slow thinking" LLMs.
Qwen2.5 Technical Report — Qwen et al. (2024)

Pretraining

Prerequisites: Qwen Technical Report

18T token pre-training, 0.5B-72B full suite + specialized Coder/Math sub-families; one of most stable open-source Chinese-English LLMs of 2024-2025. Long-term top in Hugging Face downloads/fine-tuning.
FlashAttention-3: Fast and Accurate Attention with Asympotic IO Complexity — Jay Shah et al. (2024)

Inference

Prerequisites: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Leverages H100's async TMA and FP8 to push attention to 1.2 PFLOPs while maintaining numerical precision. Key dependency for long-context + FP8 training on Hopper architecture.
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Charlie Snell et al. (2024)

ReasoningInference

Systematically presents scaling laws for "spending more compute at inference time": with fixed budget, adding inference-time search to small models often more cost-effective than training larger models. Theoretical foundation for o1/R1 era.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — John Yang et al. (2024)

Applications

Prerequisites: SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Proposes ACI (Agent-Computer Interface) concept, emphasizing "what tools/interface agent uses ≥ what model used". GPT-4 + good ACI improves SWE-bench 6x, establishing coding agent engineering methodology.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie et al. (2023)

InferenceArchitecture

Prerequisites: Fast Transformer Decoding: One Write-Head is All You Need

GQA (Grouped Query Attention) is a middle ground between MHA and MQA: grouping KV heads so multiple query heads share the same KV, significantly reducing KV cache memory while maintaining near-MHA quality. LLaMA 2/3, Mistral, and other major models all use GQA.
A General Theoretical Paradigm to Understand Learning from Human Preferences — Mohammad Gheshlaghi Azar et al. (2023)

Alignment

Prerequisites: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Unifies RLHF/DPO with Ψ-PO framework, points out DPO overfits under BT assumption; proposes more robust IPO loss. Theoretical must-read for understanding "why DPO doesn't always work"; see also KTO, SimPO.
Qwen Technical Report — Jinze Bai et al. (2023)

Pretraining

Prerequisites: LLaMA: Open and Efficient Foundation Language Models

Alibaba Qwen's first complete technical report, covering 1.8B-72B full range, emphasizing Chinese-English bilingual + tokenizer friendliness. Representative foundation for Chinese open-source LLM series; subsequent Qwen2/2.5 are 2024-2025 open-source SOTA.
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models — Yukang Chen et al. (2023)

Long ContextAlignment

Prerequisites: LoRA: Low-Rank Adaptation of Large Language Models

Uses shifted sparse attention + LoRA to extend 7B model to 100K context with just one 8xA100 machine. Engineering benchmark for long-context fine-tuning; see also YaRN, PoSE.
Accelerating Large Language Model Decoding with Speculative Sampling — Charlie Chen et al. (2023)

Inference

Prerequisites: Fast Inference from Transformers via Speculative Decoding

DeepMind's concurrent independent proposal of speculative sampling, theoretically proving acceleration while preserving sampling distribution. Together with Leviathan sets the direction; see also later Medusa, EAGLE.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao (2023)

Inference

Prerequisites: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Uses more aggressive warp-level parallelism and work partitioning to double FlashAttention performance. Today vLLM/SGLang/Megatron training backends have all upgraded to FA-2.
QLoRA: Efficient Finetuning of Quantized LLMs — Tim Dettmers et al. (2023)

AlignmentInference

Prerequisites: LoRA: Low-Rank Adaptation of Large Language Models , LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

4-bit NF4 + LoRA + paged optimizer enables SFT of 65B on single 48GB GPU. Open-source community fine-tuning of LLaMA-2/3, Qwen uses this approach almost 100%.
Gemini: A Family of Highly Capable Multimodal Models — Gemini Team et al. (2023)

PretrainingMultimodal

Prerequisites: Learning Transferable Visual Models From Natural Language Supervision

Google's multimodal model family (Ultra/Pro/Nano), representative of "natively multimodal" narrative. 1.5 series later pushes context to 1M-10M tokens, benchmark for long-context industrial deployment.
Not What You Have Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection — Kai Greshake et al. (2023)

Safety

Prerequisites: Jailbroken: How Does LLM Safety Training Fail?

Reveals indirect prompt injection attacks: adversaries control external data processed by LLM applications (web pages, emails, documents) to inject malicious instructions and hijack application behavior. Demonstrates attacks on Bing Chat, GitHub Copilot, and other real applications.
Textbooks Are All You Need — Suriya Gunasekar et al. (2023)

Pretraining

Prerequisites: Training Compute-Optimal Large Language Models

Microsoft uses 7B tokens of high-quality "textbook-level" synthetic data to train 1.3B model approaching GPT-3.5 on HumanEval. Takes "data quality >> data scale" story to extreme, launching Phi series.
Stop Uploading Test Data in Plain Text: New Protocols for Dataset Release — Alon Jacovi et al. (2023)

Evaluation

Prerequisites: Measuring Massive Multitask Language Understanding

Proposes systematic methods for detecting and preventing benchmark data contamination. By analyzing anomalous performance patterns on contaminated data (such as verbatim memorization of test sets), it reliably detects whether pretraining data contains publicly available test sets. Calls for releasing encrypted or delayed-public test sets.
Mistral 7B — Albert Q. Jiang et al. (2023)

Pretraining

Prerequisites: LLaMA: Open and Efficient Foundation Language Models

Uses GQA + sliding window attention to make 7B model outperform LLaMA-2 13B; first enters scene with "Apache 2.0 + direct weight release" stance. Leads European open-source LLM force.
Needle in a Haystack — Pressure Testing LLMs — Greg Kamradt (2023)

Long ContextEvaluation

Prerequisites: YaRN: Efficient Context Window Extension of Large Language Models

Proposes the Needle-in-a-Haystack test: inserting a key fact at random positions in a long document and testing whether the model can locate it when answering questions. Became the de facto standard for evaluating factual retrieval in long-context models, revealing the "lost in the middle" problem in most models.
Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon et al. (2023)

Inference

Prerequisites: Fast Transformer Decoding: One Write-Head is All You Need

Introduces OS "paged memory" concept to KV cache, virtually eliminating OOM waste and multiplying throughput 2-4x. vLLM thereby becomes de facto standard open-source inference engine; compute foundation for MCP/Agent era.
Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan et al. (2023)

Inference

Prerequisites: Attention Is All You Need

Uses small draft model to predict multiple tokens, large model verifies in one pass, achieving nearly lossless 2-3x speedup. Standard technique in all inference engines (vLLM, TensorRT-LLM) today.
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback — Harrison Lee et al. (2023)

Alignment

Prerequisites: Training language models to follow instructions with human feedback

Google systematically proves RLAIF can match RLHF on various tasks, providing engineering evidence for "AI feedback replacing human" as scalable alignment solution.
Let's Verify Step by Step — Hunter Lightman et al. (2023)

ReasoningEvaluation

Prerequisites: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Proposes process supervision: rewarding not just the final correct answer but also the correctness of each reasoning step. By training a verifier to evaluate each step, significantly outperforms outcome supervision (which only rewards the final result) on mathematical reasoning tasks.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin et al. (2023)

Inference

Prerequisites: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Discovers "few critical weights correspond to large activations", applies per-channel scaling by importance. More robust and faster than GPTQ at 4-bit, one of mainstream INT4 deployment solutions today.
H2O: Heavy-Hitter Oracle for Accurate KV Cache Compression — Zichang Liu et al. (2023)

Inference

Prerequisites: Fast Transformer Decoding: One Write-Head is All You Need

Discovers Heavy Hitters in KV Cache: a small set of tokens contributes most attention weights. H2O preserves these heavy-hitter KV pairs, maintaining near-lossless performance with only 20-30% of the original KV cache.
Visual Instruction Tuning — Haotian Liu et al. (2023)

Multimodal

Prerequisites: Flamingo: a Visual Language Model for Few-Shot Learning

CLIP vision encoder + LLaMA + GPT-4 synthesized multimodal instruction data creates first open-source GPT-4V style model with minimal compute. Starting point for open-source multimodal ecosystem (LLaVA-1.5/1.6, Qwen-VL, InternVL).
GPT-4V(ision) System Card — OpenAI (2023)

Multimodal

Prerequisites: Learning Transferable Visual Models From Natural Language Supervision

First production-grade multimodal LLM safety/capability disclosure document. Unifies "image + text" into ChatGPT, key step before GPT-4o's end-to-end audio/image/video.
GPT-4 Technical Report — OpenAI (2023)

Pretraining

Prerequisites: Language Models are Few-Shot Learners

Industry report rather than full paper, but first to explicitly use "predictable scaling" as product delivery commitment and systematically disclose safety/red-team processes. Turning point from LLM as "research demo" to "infrastructure".
YaRN: Efficient Context Window Extension of Large Language Models — Bowen Peng et al. (2023)

Long Context

Prerequisites: RoFormer: Enhanced Transformer with Rotary Position Embedding

Applies NTK-aware interpolation + temperature correction on RoPE, extending context to 64K-128K with minimal training. Most open-source models today use YaRN or variants for length extension.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafael Rafailov et al. (2023)

Alignment

Prerequisites: Training language models to follow instructions with human feedback

DPO (Direct Preference Optimization) shows that the reward model + RL two-step in RLHF can be collapsed into a single supervised learning step: directly optimizing language model parameters on preference data, mathematically equivalent to the optimal RLHF policy. DPO has become the dominant RLHF alternative in alignment research and open-source community.
Toolformer: Language Models Can Teach Themselves to Use Tools — Timo Schick et al. (2023)

Applications

Prerequisites: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Makes model generate "API-calling tokens" itself and evaluate usefulness through self-supervision. Foundational paper for function-calling/tool-use training paradigm, directly influencing GPT-4 function calling design.
Reflexion: Language Agents with Verbal Reinforcement Learning — Noah Shinn et al. (2023)

Applications

Prerequisites: ReAct: Synergizing Reasoning and Acting in Language Models

Makes agent do natural language "post-mortem" after failure, injecting reflection into next round's prompt. "Gradient-free self-improvement" approach widely reused in coding agents, SWE-agent.
Alpaca: A Strong, Replicable Instruction-Following Model — Rohan Taori et al. (2023)

Alignment

Prerequisites: Training language models to follow instructions with human feedback

Uses 52K self-instruct data + LLaMA 7B to replicate GPT-3.5 style responses for $5. Launched open-source instruction tuning wave, starting point of 2023's "llama wars".
LLaMA: Open and Efficient Foundation Language Models — Hugo Touvron et al. (2023)

Pretraining

Prerequisites: Language Models are Unsupervised Multitask Learners

Meta implements "small but precise + massive tokens" Chinchilla recipe and opens weights. LLaMA 1 directly catalyzed open-source LLM explosion (Alpaca/Vicuna/Mistral/Qwen all benefited).
Llama 2: Open Foundation and Fine-Tuned Chat Models — Meta AI (2023)

PretrainingAlignment

Prerequisites: LLaMA: Open and Efficient Foundation Language Models

First commercially licensed high-quality open-source chat model, publicly shares RLHF recipe (PPO + GAtt). Directly advances open-source ecosystem to "near ChatGPT experience" stage.
Large Language Models are not Fair Evaluators — Peiyi Wang et al. (2023)

Evaluation

Prerequisites: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Systematically evaluates bias issues in LLM-as-a-Judge methods: position bias (preferring the first response), length bias (preferring longer responses), and self-enhancement bias (preferring self-generated content). Proposes mitigation methods such as position-swapped evaluation and reference-based scoring.
Jailbroken: How Does LLM Safety Training Fail? — Alexander Wei et al. (2023)

Safety

Prerequisites: Red Teaming Language Models with Language Models

Systematically classifies jailbreak methods (out-of-distribution, goal conflict) and explains why RLHF struggles to eradicate them. Reference material for jailbreak research "taxonomy".
Efficient Guided Generation for Large Language Models — Brandon T. Willard et al. (2023)

InferenceApplications

Prerequisites: Attention Is All You Need

Proposes efficient constrained decoding that enforces JSON Schema, regular expressions, or context-free grammars during generation. Converts syntax constraints into finite-state automata, guaranteeing correct output format with minimal latency overhead.
Efficient Streaming Language Models with Attention Sinks — Guangxuan Xiao et al. (2023)

InferenceLong Context

Prerequisites: RoFormer: Enhanced Transformer with Rotary Position Embedding

Discovers the Attention Sink phenomenon: in autoregressive generation, models consistently attend to a few initial tokens. StreamingLLM leverages this to handle infinite-length input streams without recomputation while maintaining stable performance.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Shunyu Yao et al. (2023)

ReasoningApplications

Prerequisites: ReAct: Synergizing Reasoning and Acting in Language Models

Tree of Thoughts (ToT) models problem solving as tree search: LLMs generate multiple "thought steps" as tree nodes, score them with an evaluator, and search with BFS/DFS. On tasks requiring complex planning (e.g., Game of 24), ToT massively outperforms CoT and is a precursor to o1-style slow thinking.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Lianmin Zheng et al. (2023)

Evaluation

Prerequisites: Holistic Evaluation of Language Models

Proposes GPT-4-as-judge + human preference crowdsourcing (Chatbot Arena) for evaluating dialogue capability. MT-Bench and Arena ELO remain community's de facto dual standards for comparing model "dialogue capability" today.
Universal and Transferable Adversarial Attacks on Aligned Language Models — Andy Zou et al. (2023)

Safety

Prerequisites: Jailbroken: How Does LLM Safety Training Fail?

Uses GCG algorithm to find gibberish suffix that breaks through aligned LLaMA-2/Vicuna, with attacks transferring across multiple closed-source models. Shocked entire security community, making "alignment fragility" mainstream topic.
Flamingo: a Visual Language Model for Few-Shot Learning — Jean-Baptiste Alayrac et al. (2022)

Multimodal

Prerequisites: Learning Transferable Visual Models From Natural Language Supervision

Uses Perceiver Resampler to connect image features to frozen LLM for few-shot visual QA. Ancestor of mainstream "plug-in multimodal" approach (LLaVA, IDEFICS, etc.).
Constitutional AI: Harmlessness from AI Feedback — Yuntao Bai et al. (2022)

AlignmentSafety

Prerequisites: Training language models to follow instructions with human feedback

Anthropic's Constitutional AI (CAI): use a set of explicit "constitution" principles to let the model self-critique and revise (SL-CAI phase), then use AI feedback instead of human feedback for RLHF (RLAIF phase). This reduces reliance on human annotation and is the core alignment technique behind the Claude model family.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback — Yuntao Bai et al. (2022)

AlignmentSafety

Prerequisites: Deep Reinforcement Learning from Human Preferences

Anthropic's early RLHF paper, HH-RLHF dataset since then became "MNIST" of open-source alignment research. Earliest systematic work understanding helpful vs harmless tension.
Improving language models by retrieving from trillions of tokens — Sebastian Borgeaud et al. (2022)

Applications

Prerequisites: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

DeepMind introduces chunked retrieval during pre-training, making 7B model match 175B GPT-3. Proves retrieval isn't just RAG inference trick, but another possible pre-training paradigm.
PaLM: Scaling Language Modeling with Pathways — Aakanksha Chowdhery et al. (2022)

Pretraining

Prerequisites: Language Models are Few-Shot Learners

Google's 540B parameter PaLM model trained on the Pathways system. The paper details training stability techniques, data mixture strategies, and observations of emergent capabilities, serving as an important reference for large-model pretraining engineering.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao et al. (2022)

Inference

Prerequisites: Attention Is All You Need

FlashAttention uses IO-aware tiled computation to reduce attention memory from O(N²) to O(N) without losing precision, achieving 2-4x speedup. It fundamentally changed what's feasible for long-context training and is now an indispensable optimization in modern LLM training and inference.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale — Tim Dettmers et al. (2022)

Inference

Reveals "emergent outliers" in large model activations and proposes mixed-precision solution. Core work behind bitsandbytes library, first enabling 175B models to fit in 8 A100s.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Elias Frantar et al. (2022)

Inference

Prerequisites: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

First to achieve "4-bit quantization of 175B model on single GPU with almost no accuracy loss". Lowered LLM inference hardware barrier from 8xA100 to single consumer GPU, popularizing "run open-source LLMs locally".
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned — Deep Ganguli et al. (2022)

SafetyEvaluation

Prerequisites: Red Teaming Language Models with Language Models

Systematically studies red teaming methods for language models, finding that harmful output rates may decrease with scale, but models become better at circumventing human-written safety rules. Proposes best practices for scaled red teaming.
Precise Zero-Shot Dense Retrieval without Relevance Labels — Luyu Gao et al. (2022)

Applications

Prerequisites: Dense Passage Retrieval for Open-Domain Question Answering

Makes LLM "pretend" to generate an answer first, then uses its embedding to retrieve real documents. Zero-shot, strong generalization—one of the most reused retrieval enhancement tricks in RAG era.
Training Compute-Optimal Large Language Models — Jordan Hoffmann et al. (2022)

Pretraining

Prerequisites: Scaling Laws for Neural Language Models

Proposes the Chinchilla scaling laws: given a fixed compute budget, model parameters and training tokens should scale equally (challenging the prior belief that parameters matter more). Chinchilla 70B outperformed Gopher 280B, redefining optimal LLM training strategy.
Large Language Models are Zero-Shot Reasoners — Takeshi Kojima et al. (2022)

Reasoning

Prerequisites: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

A single phrase "Let's think step by step" boosts math accuracy from ~17% to ~78%. CoT capability is inherent in models, triggered by prompts—this discovery shocked the entire community.
Deduplicating Training Data Makes Language Models Better — Katherine Lee et al. (2022)

Pretraining

Prerequisites: Scaling Laws for Neural Language Models

Systematically demonstrates that deduplicating training data significantly improves language model performance and reduces memorization. By removing near-duplicate and exact-duplicate examples from C4 and RealNews, models perform better on downstream tasks and are far less likely to emit training data verbatim.
Holistic Evaluation of Language Models — Percy Liang et al. (2022)

Evaluation

Prerequisites: Measuring Massive Multitask Language Understanding

Stanford CRFM systematically evaluates 30+ LLMs × multidimensional metrics (accuracy, robustness, fairness, efficiency...), establishing "evaluation science". Representative work against "only looking at average scores".
Training language models to follow instructions with human feedback — Long Ouyang et al. (2022)

Alignment

Prerequisites: Deep Reinforcement Learning from Human Preferences , Learning to summarize from human feedback

InstructGPT introduces the three-stage RLHF pipeline (SFT → reward model → PPO) that transforms language models from "predict next token" to "follow human intent." This is the direct blueprint for ChatGPT and established the dominant approach to LLM alignment.
Red Teaming Language Models with Language Models — Ethan Perez et al. (2022)

SafetyEvaluation

Prerequisites: Language Models are Few-Shot Learners

DeepMind uses one LLM to automatically generate attack prompts for red-teaming another LLM, engineering red-teaming. Safety/jailbreak research since then shifted from "manual prompt search" to automated paradigm.
Self-Consistency Improves Chain of Thought Reasoning in Language Models — Xuezhi Wang et al. (2022)

Reasoning

Prerequisites: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Self-Consistency is a key improvement to CoT: instead of greedy decoding a single reasoning chain, sample multiple diverse reasoning paths and take the most frequent answer (majority vote). This simple trick improves accuracy by 10-20 percentage points on multiple reasoning benchmarks.
Self-Instruct: Aligning Language Models with Self-Generated Instructions — Yizhong Wang et al. (2022)

Alignment

Prerequisites: Alpaca: A Strong, Replicable Instruction-Following Model

Uses GPT-3 to generate instruction-output data and distill to itself. Stanford Alpaca/Vicuna both based on this, opening "use large models to generate data for training small models" synthetic data era.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Jason Wei et al. (2022)

ReasoningApplications

Prerequisites: Language Models are Few-Shot Learners

Introduces chain-of-thought prompting: adding intermediate reasoning steps to prompts dramatically improves LLM performance on math, logic, and commonsense reasoning tasks. This simple technique brought LLM reasoning capabilities close to human-level performance.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — Guangxuan Xiao et al. (2022)

Inference

Prerequisites: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Moves activation outliers to weights through equivalent mathematical transformation, making INT8 inference feasible. Key engineering discovery enabling GPU FP8/INT8 deployment.
ReAct: Synergizing Reasoning and Acting in Language Models — Shunyu Yao et al. (2022)

Applications

Prerequisites: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

ReAct interleaves reasoning and acting: LLM thinks (Thought), executes a tool call (Action), observes the result (Observation), and cycles. This is the prototype for modern AI agent frameworks, directly influencing LangChain, AutoGPT, and similar agent frameworks.
GLM-130B: An Open Bilingual Pre-trained Model — Aohan Zeng et al. (2022)

Architecture

Prerequisites: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Tsinghua + Zhipu's open Chinese-English bilingual 130B model, earliest representative technical report of Chinese LLM industrialization. Subsequent ChatGLM-6B/9B pushed open-source Chinese dialogue to mass adoption.
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models — Denny Zhou et al. (2022)

Reasoning

Prerequisites: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

"Break hard problems into easy ones, solve sequentially" is another reasoning paradigm parallel to CoT, especially effective for compositional generalization. Together with CoT/ToT forms the trio of "how to guide LLM step-by-step thinking".
Extracting Training Data from Large Language Models — Nicholas Carlini et al. (2021)

Safety

Prerequisites: Jailbroken: How Does LLM Safety Training Fail?

Demonstrates the feasibility of extracting training data fragments from language models like GPT-2. Through carefully designed decoding strategies, hundreds of verbatim memorized training examples can be recovered, revealing privacy risks in large language models.
Evaluating Large Language Models Trained on Code — Mark Chen et al. (2021)

EvaluationApplications

Proposes Codex model + HumanEval benchmark (164 programming problems). HumanEval remains "ECG metric" for coding models today; this paper is also root of GitHub Copilot.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts — Nan Du et al. (2021)

Mixture of ExpertsPretraining

Prerequisites: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

1.2T parameter MoE achieves GPT-3 quality with 1/3 training compute, early representative of MoE "cost-effectiveness wins". Mixtral/DeepSeek-V2/V3 are its spiritual descendants.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — William Fedus et al. (2021)

Mixture of ExpertsPretraining

Prerequisites: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Switch Transformer is the first architecture to scale Transformers to trillion parameters in practice. Using Mixture-of-Experts (MoE), each token only activates a small fraction of parameters ("sparse activation"), achieving better performance than dense models at the same compute. GPT-4 and Mixtral likely use similar architectures.
LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu et al. (2021)

Alignment

Prerequisites: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

LoRA freezes pretrained weights and only trains the product of two low-rank matrices (rank r much smaller than original dimensions), reducing trainable parameters by up to 10,000x. This makes fine-tuning large models on consumer GPUs feasible and has become the dominant parameter-efficient fine-tuning (PEFT) method.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation — Ofir Press et al. (2021)

ArchitectureLong Context

Prerequisites: Attention Is All You Need

Converts position information into linear bias on attention, enabling extrapolation to several times training length with zero parameters. Representative early long-context solution, competing with RoPE as two alternative approaches.
Learning Transferable Visual Models From Natural Language Supervision — Alec Radford et al. (2021)

Multimodal

Prerequisites: Learning Transferable Visual Models From Natural Language Supervision

The original CLIP paper, proposing learning transferable visual representations from natural language supervision. By training a contrastive model on 400 million image-text pairs, CLIP achieves zero-shot image classification and demonstrates strong cross-task transferability, pioneering a new paradigm for vision-language alignment.
Learning Transferable Visual Models From Natural Language Supervision — Alec Radford et al. (2021)

Multimodal

Prerequisites: Language Models are Unsupervised Multitask Learners

Uses 400M image-text pairs for contrastive learning to obtain universal vision encoder. CLIP embeddings remain the vision frontend for almost all multimodal systems (DALL·E, Stable Diffusion, LLaVA) today.
RoFormer: Enhanced Transformer with Rotary Position Embedding — Jianlin Su et al. (2021)

ArchitectureLong Context

Prerequisites: Attention Is All You Need

RoPE (Rotary Position Embedding) is the position encoding scheme used in most major LLMs today (LLaMA, Mistral, Qwen, etc.). By incorporating position information as rotation matrices in attention computation, it elegantly handles relative positions and generalizes much better than absolute position encoding when extrapolating to longer context lengths.
Language Models are Few-Shot Learners — Tom Brown et al. (2020)

PretrainingReasoning

Prerequisites: Language Models are Unsupervised Multitask Learners

OpenAI's GPT-3 paper demonstrates that a 175B parameter language model can perform diverse tasks through few-shot in-context learning without fine-tuning. It established the paradigm that scale unlocks emergent capabilities and launched the era of prompt engineering.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators — Kevin Clark et al. (2020)

Pretraining

Prerequisites: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Uses replaced token detection instead of MLM, allowing small models to achieve BERT-large level performance. Representative work on "pre-training objective determines sample efficiency".
Do not Stop Pretraining: Adapt Language Models to Domains and Tasks — Suchin Gururangan et al. (2020)

Pretraining

Prerequisites: Universal Language Model Fine-tuning for Text Classification

Demonstrates that continuing pretraining on target-domain data (Domain-Adaptive Pretraining, DAPT) significantly improves task performance. Across biomedical, computer science, news, and reviews domains, DAPT improves over generic pretrained models by 4-8 percentage points on average.
Measuring Massive Multitask Language Understanding — Dan Hendrycks et al. (2020)

Evaluation

Prerequisites: Language Models are Few-Shot Learners

57 subjects with 14K exam questions, since then "grinding MMLU" became de facto standard for measuring LLM general capability. Still first-line metric in model cards even in 2025; see also later MMLU-Pro.
Scaling Laws for Neural Language Models — Jared Kaplan et al. (2020)

Pretraining

Prerequisites: Language Models are Few-Shot Learners

OpenAI's scaling laws paper finds that language model performance (cross-entropy loss) follows power laws with model parameters, dataset size, and compute. This enables predicting large-scale training results from small experiments and provided the theoretical basis for the LLM scale-up race, directly leading to GPT-3.
Dense Passage Retrieval for Open-Domain Question Answering — Vladimir Karpukhin et al. (2020)

Applications

Prerequisites: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Dual-tower BERT + in-batch negatives trains first industrial-grade dense retriever, virtually eliminating BM25 overnight. Today's vector search (FAISS, pgvector) engineering paradigm solidified here.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Patrick Lewis et al. (2020)

Applications

Prerequisites: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RAG (Retrieval-Augmented Generation) combines pretrained LMs with information retrieval: for each query, retrieve relevant documents from a knowledge base, then generate answers with the documents in context. This addresses LLM knowledge staleness and hallucination, and is now a core architecture in enterprise AI applications.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — Colin Raffel et al. (2020)

Architecture

Prerequisites: Attention Is All You Need

T5 unifies all NLP tasks into a "text-to-text" format (e.g., classification also outputs label text rather than class IDs) and systematically explores how dataset, architecture, pretraining objectives, and scale affect transfer learning. This unified paradigm became a key inspiration for instruction tuning and instruction-following models.
The Right Tool for the Job: Matching Model and Instance Complexities — Roy Schwartz et al. (2020)

Inference

Prerequisites: Attention Is All You Need

Proposes adaptive computation: different input instances require different amounts of computation. By training a lightweight router to assign simple samples to smaller models and complex samples to larger models, reduces average inference cost by 2-3x with minimal accuracy loss.
Learning to summarize from human feedback — Nisan Stiennon et al. (2020)

Alignment

Prerequisites: Deep Reinforcement Learning from Human Preferences

OpenAI's first application of RLHF to large language models (summarization), proving RLHF systematically better than SFT/MLE on human preferences. Direct predecessor to InstructGPT.
RoBERTa: A Robustly Optimized BERT Pretraining Approach — Yinhan Liu et al. (2019)

Pretraining

Prerequisites: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Uses more data, longer training, removes NSP to prove BERT was far from fully trained. Important not just for stronger model, but for first clearly demonstrating that "training recipe" itself is a core research question.
Language Models are Unsupervised Multitask Learners — Alec Radford et al. (2019)

Pretraining

Prerequisites: Improving Language Understanding by Generative Pre-Training (GPT-1)

GPT-2 shows that a 1.5B parameter language model trained only on unlabeled web text can perform various language tasks zero-shot without fine-tuning. This challenged the convention that NLP tasks require task-specific training and famously became the first AI model "staged released" due to misuse concerns.
Fast Transformer Decoding: One Write-Head is All You Need — Noam Shazeer (2019)

InferenceArchitecture

Prerequisites: Attention Is All You Need

Proposes Multi-Query Attention: all heads share the same K/V, reducing KV cache usage to 1/h. All modern KV cache optimization and long-context inference stories start from this 5-page paper.
XLNet: Generalized Autoregressive Pretraining for Language Understanding — Zhilin Yang et al. (2019)

Architecture

Prerequisites: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Proposes Permutation LM to merge benefits of AR and AE, combined with Transformer-XL for long sequences. Shows "pre-training objective" is still an open question, most imaginative alternative after BERT.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Jacob Devlin et al. (2018)

Architecture

Prerequisites: Attention Is All You Need

BERT uses masked language modeling (MLM) and next sentence prediction to pretrain a bidirectional Transformer on large text corpora, then fine-tunes for downstream tasks. It simultaneously surpassed SOTA on 11 NLP benchmarks, establishing the "pretrain+finetune" paradigm that dominates modern NLP.
Universal Language Model Fine-tuning for Text Classification — Jeremy Howard et al. (2018)

Pretraining

Prerequisites: Efficient Estimation of Word Representations in Vector Space

First paper to explicitly propose the "universal language model pre-training → task fine-tuning" pipeline, with key tricks like discriminative LR and slanted triangular schedule. Together with ELMo, represents "the last mile before BERT".
Deep contextualized word representations — Matthew E. Peters et al. (2018)

Architecture

Prerequisites: GloVe: Global Vectors for Word Representation

ELMo introduced contextualized word embeddings: the same word has different vector representations in different contexts (e.g., "bank" in financial vs. riverbank contexts). Using bidirectional LSTMs, ELMo set new SOTA on multiple NLP tasks and laid the conceptual foundation for BERT and subsequent pretrained models.
Improving Language Understanding by Generative Pre-Training (GPT-1) — Alec Radford et al. (2018)

ArchitecturePretraining

Prerequisites: Attention Is All You Need

OpenAI's first proposal of decoder-only + autoregressive pre-training + task fine-tuning, establishing the foundation for GPT-2/3/4 series. Less popular than BERT initially, but proven to be the winning direction years later.
Deep Reinforcement Learning from Human Preferences — Paul Christiano et al. (2017)

Alignment

The foundational RLHF paper. The authors show that training a reward model from human pairwise preferences, then using it to guide reinforcement learning, enables agents to learn complex behaviors that are difficult to specify with explicit reward functions. This framework was directly adopted by InstructGPT/ChatGPT.
Attention Is All You Need — Ashish Vaswani et al. (2017)

Architecture

Prerequisites: Neural Machine Translation by Jointly Learning to Align and Translate , Effective Approaches to Attention-based Neural Machine Translation

The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.
Neural Machine Translation in Linear Time — Nal Kalchbrenner et al. (2016)

Architecture

Prerequisites: Sequence to Sequence Learning with Neural Networks

Uses dilated convolutions for seq2seq, liberating sequence modeling from "must use RNN sequential computation". Together with ConvS2S, represents the strongest attempt at parallel sequence modeling before Transformer.
Neural Machine Translation of Rare Words with Subword Units — Rico Sennrich et al. (2016)

Architecture

Proposes applying BPE (Byte Pair Encoding) to tokenization for neural machine translation. By iteratively merging the most frequent character pairs, BPE balances vocabulary size and ability to handle rare words. This is the direct prototype for tokenizers in GPT and most modern LLMs.
Effective Approaches to Attention-based Neural Machine Translation — Minh-Thang Luong et al. (2015)

Architecture

Prerequisites: Neural Machine Translation by Jointly Learning to Align and Translate

Systematically compares global vs local attention and different scoring functions (dot/general/concat). The most commonly cited engineering reference when explaining "how attention scores are computed".
Neural Machine Translation by Jointly Learning to Align and Translate — Dzmitry Bahdanau et al. (2014)

Architecture

The seminal attention mechanism paper (pre-Transformer). The authors found that seq2seq's fixed-length bottleneck vector limited translation quality, and proposed letting the decoder dynamically attend to all encoder hidden states when generating each word. This idea directly evolved into Transformer self-attention.
Convolutional Neural Networks for Sentence Classification — Yoon Kim (2014)

Architecture

Uses CNN with pre-trained word vectors for text classification, proving "pre-trained embedding + simple architecture" beats hand-crafted features. An early sign of pre-training paradigm entering NLP.
GloVe: Global Vectors for Word Representation — Jeffrey Pennington et al. (2014)

Architecture

GloVe learns word vectors by factorizing word co-occurrence matrices, combining the advantages of count-based methods (LSA) and prediction-based methods (Word2Vec). It achieved state-of-the-art on word analogy and similarity tasks and remains a widely used baseline word vector in academia.
Sequence to Sequence Learning with Neural Networks — Ilya Sutskever et al. (2014)

Architecture

The foundational seq2seq (encoder-decoder) architecture paper. Using two LSTMs in a compress-then-generate structure, it enabled neural networks to perform variable-length sequence-to-sequence transformations for the first time, achieving breakthroughs in machine translation and directly inspiring the Transformer's encoder-decoder design.
Distributed Representations of Words and Phrases and their Compositionality — Tomas Mikolov et al. (2013)

Architecture

Prerequisites: Efficient Estimation of Word Representations in Vector Space

The NeurIPS version of word2vec, introducing Negative Sampling, Hierarchical Softmax, and phrase-level vectors. Influenced all subsequent embedding training objectives including GloVe, fastText, and modern LLM embedding layers.
Efficient Estimation of Word Representations in Vector Space — Tomas Mikolov et al. (2013)

Architecture

Word2Vec introduced the concept of word embeddings: training neural networks on large text corpora so semantically similar words cluster in vector space. The famous "king - man + woman ≈ queen" analogy demonstrated its power, laying the foundation for embedding layers in all subsequent language models.