Skip to content

Paper Library

95 carefully selected LLM papers, each with a bilingual TLDR.

  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI (2025)

    DeepSeek-R1 shows that o1-like chain-of-thought reasoning can emerge purely from reinforcement learning (without supervised fine-tuning warmup), using GRPO instead of PPO. Fully open-source (weights + training details), it matches OpenAI o1 on multiple reasoning benchmarks and is one of the most significant open-source LLM results of 2025.

  • Model Context Protocol (MCP) — Anthropic (2024)

    The Model Context Protocol (MCP) is an open standard proposed by Anthropic for how LLM applications communicate standardly with external tools, data sources, and services. Through unified "resources/tools/prompts" interfaces, any MCP-compatible tool can seamlessly connect to any MCP-compatible model — aiming to be the "USB standard" for AI tool use.

  • The Llama 3 Herd of Models — Meta AI (2024)

    Meta's LLaMA 3 technical report covering models from 8B to 405B parameters. Details data processing (15T tokens, multilingual), architecture improvements (GQA, extended RoPE), training pipeline (SFT + RLHF + DPO), and multimodal extension integration. LLaMA 3 405B is one of the most capable open-source LLMs available.

  • Mixtral of Experts — Albert Q. Jiang et al. (2024)

    Mixtral 8x7B is the first widely open-sourced MoE language model: 8 expert networks, each token routes to 2, so ~13B parameters are activated with 47B total. At inference cost similar to a 13B dense model, it matches or surpasses LLaMA 2 70B, proving MoE viability for open-source models.

  • OpenAI o1 System Card — OpenAI (2024)

    OpenAI o1's system card reveals the approach of training "slow thinking" models via large-scale reinforcement learning: the model performs extended internal reasoning chains before answering, dramatically outperforming GPT-4 on math competitions and coding. This marks a paradigm shift from "fast thinking" to "slow thinking" LLMs.

  • GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie et al. (2023)

    GQA (Grouped Query Attention) is a middle ground between MHA and MQA: grouping KV heads so multiple query heads share the same KV, significantly reducing KV cache memory while maintaining near-MHA quality. LLaMA 2/3, Mistral, and other major models all use GQA.

  • GPT-4 Technical Report — OpenAI (2023)

    Industry report rather than full paper, but first to explicitly use "predictable scaling" as product delivery commitment and systematically disclose safety/red-team processes. Turning point from LLM as "research demo" to "infrastructure".

  • GPT-4V(ision) System Card — OpenAI (2023)

    First production-grade multimodal LLM safety/capability disclosure document. Unifies "image + text" into ChatGPT, key step before GPT-4o's end-to-end audio/image/video.

  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafael Rafailov et al. (2023)

    DPO (Direct Preference Optimization) shows that the reward model + RL two-step in RLHF can be collapsed into a single supervised learning step: directly optimizing language model parameters on preference data, mathematically equivalent to the optimal RLHF policy. DPO has become the dominant RLHF alternative in alignment research and open-source community.

  • Alpaca: A Strong, Replicable Instruction-Following Model — Rohan Taori et al. (2023)

    Uses 52K self-instruct data + LLaMA 7B to replicate GPT-3.5 style responses for $5. Launched open-source instruction tuning wave, starting point of 2023's "llama wars".

  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Shunyu Yao et al. (2023)

    Tree of Thoughts (ToT) models problem solving as tree search: LLMs generate multiple "thought steps" as tree nodes, score them with an evaluator, and search with BFS/DFS. On tasks requiring complex planning (e.g., Game of 24), ToT massively outperforms CoT and is a precursor to o1-style slow thinking.

  • Constitutional AI: Harmlessness from AI Feedback — Yuntao Bai et al. (2022)

    Anthropic's Constitutional AI (CAI): use a set of explicit "constitution" principles to let the model self-critique and revise (SL-CAI phase), then use AI feedback instead of human feedback for RLHF (RLAIF phase). This reduces reliance on human annotation and is the core alignment technique behind the Claude model family.

  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao et al. (2022)

    FlashAttention uses IO-aware tiled computation to reduce attention memory from O(N²) to O(N) without losing precision, achieving 2-4x speedup. It fundamentally changed what's feasible for long-context training and is now an indispensable optimization in modern LLM training and inference.

  • Training Compute-Optimal Large Language Models — Jordan Hoffmann et al. (2022)

    Proposes the Chinchilla scaling laws: given a fixed compute budget, model parameters and training tokens should scale equally (challenging the prior belief that parameters matter more). Chinchilla 70B outperformed Gopher 280B, redefining optimal LLM training strategy.

  • Training language models to follow instructions with human feedback — Long Ouyang et al. (2022)

    InstructGPT introduces the three-stage RLHF pipeline (SFT → reward model → PPO) that transforms language models from "predict next token" to "follow human intent." This is the direct blueprint for ChatGPT and established the dominant approach to LLM alignment.

  • Self-Consistency Improves Chain of Thought Reasoning in Language Models — Xuezhi Wang et al. (2022)

    Self-Consistency is a key improvement to CoT: instead of greedy decoding a single reasoning chain, sample multiple diverse reasoning paths and take the most frequent answer (majority vote). This simple trick improves accuracy by 10-20 percentage points on multiple reasoning benchmarks.

  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Jason Wei et al. (2022)

    Introduces chain-of-thought prompting: adding intermediate reasoning steps to prompts dramatically improves LLM performance on math, logic, and commonsense reasoning tasks. This simple technique brought LLM reasoning capabilities close to human-level performance.

  • ReAct: Synergizing Reasoning and Acting in Language Models — Shunyu Yao et al. (2022)

    ReAct interleaves reasoning and acting: LLM thinks (Thought), executes a tool call (Action), observes the result (Observation), and cycles. This is the prototype for modern AI agent frameworks, directly influencing LangChain, AutoGPT, and similar agent frameworks.

  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — William Fedus et al. (2021)

    Switch Transformer is the first architecture to scale Transformers to trillion parameters in practice. Using Mixture-of-Experts (MoE), each token only activates a small fraction of parameters ("sparse activation"), achieving better performance than dense models at the same compute. GPT-4 and Mixtral likely use similar architectures.

  • LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu et al. (2021)

    LoRA freezes pretrained weights and only trains the product of two low-rank matrices (rank r much smaller than original dimensions), reducing trainable parameters by up to 10,000x. This makes fine-tuning large models on consumer GPUs feasible and has become the dominant parameter-efficient fine-tuning (PEFT) method.

  • RoFormer: Enhanced Transformer with Rotary Position Embedding — Jianlin Su et al. (2021)

    RoPE (Rotary Position Embedding) is the position encoding scheme used in most major LLMs today (LLaMA, Mistral, Qwen, etc.). By incorporating position information as rotation matrices in attention computation, it elegantly handles relative positions and generalizes much better than absolute position encoding when extrapolating to longer context lengths.

  • Language Models are Few-Shot Learners — Tom Brown et al. (2020)

    OpenAI's GPT-3 paper demonstrates that a 175B parameter language model can perform diverse tasks through few-shot in-context learning without fine-tuning. It established the paradigm that scale unlocks emergent capabilities and launched the era of prompt engineering.

  • Scaling Laws for Neural Language Models — Jared Kaplan et al. (2020)

    OpenAI's scaling laws paper finds that language model performance (cross-entropy loss) follows power laws with model parameters, dataset size, and compute. This enables predicting large-scale training results from small experiments and provided the theoretical basis for the LLM scale-up race, directly leading to GPT-3.

  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Patrick Lewis et al. (2020)

    RAG (Retrieval-Augmented Generation) combines pretrained LMs with information retrieval: for each query, retrieve relevant documents from a knowledge base, then generate answers with the documents in context. This addresses LLM knowledge staleness and hallucination, and is now a core architecture in enterprise AI applications.

  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — Colin Raffel et al. (2020)

    T5 unifies all NLP tasks into a "text-to-text" format (e.g., classification also outputs label text rather than class IDs) and systematically explores how dataset, architecture, pretraining objectives, and scale affect transfer learning. This unified paradigm became a key inspiration for instruction tuning and instruction-following models.

  • Language Models are Unsupervised Multitask Learners — Alec Radford et al. (2019)

    GPT-2 shows that a 1.5B parameter language model trained only on unlabeled web text can perform various language tasks zero-shot without fine-tuning. This challenged the convention that NLP tasks require task-specific training and famously became the first AI model "staged released" due to misuse concerns.

  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Jacob Devlin et al. (2018)

    BERT uses masked language modeling (MLM) and next sentence prediction to pretrain a bidirectional Transformer on large text corpora, then fine-tunes for downstream tasks. It simultaneously surpassed SOTA on 11 NLP benchmarks, establishing the "pretrain+finetune" paradigm that dominates modern NLP.

  • Deep contextualized word representations — Matthew E. Peters et al. (2018)

    ELMo introduced contextualized word embeddings: the same word has different vector representations in different contexts (e.g., "bank" in financial vs. riverbank contexts). Using bidirectional LSTMs, ELMo set new SOTA on multiple NLP tasks and laid the conceptual foundation for BERT and subsequent pretrained models.

  • Improving Language Understanding by Generative Pre-Training (GPT-1) — Alec Radford et al. (2018)

    OpenAI's first proposal of decoder-only + autoregressive pre-training + task fine-tuning, establishing the foundation for GPT-2/3/4 series. Less popular than BERT initially, but proven to be the winning direction years later.

  • Deep Reinforcement Learning from Human Preferences — Paul Christiano et al. (2017)

    The foundational RLHF paper. The authors show that training a reward model from human pairwise preferences, then using it to guide reinforcement learning, enables agents to learn complex behaviors that are difficult to specify with explicit reward functions. This framework was directly adopted by InstructGPT/ChatGPT.

  • Attention Is All You Need — Ashish Vaswani et al. (2017)

    The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.

  • Neural Machine Translation by Jointly Learning to Align and Translate — Dzmitry Bahdanau et al. (2014)

    The seminal attention mechanism paper (pre-Transformer). The authors found that seq2seq's fixed-length bottleneck vector limited translation quality, and proposed letting the decoder dynamically attend to all encoder hidden states when generating each word. This idea directly evolved into Transformer self-attention.

  • Sequence to Sequence Learning with Neural Networks — Ilya Sutskever et al. (2014)

    The foundational seq2seq (encoder-decoder) architecture paper. Using two LSTMs in a compress-then-generate structure, it enabled neural networks to perform variable-length sequence-to-sequence transformations for the first time, achieving breakthroughs in machine translation and directly inspiring the Transformer's encoder-decoder design.

  • Efficient Estimation of Word Representations in Vector Space — Tomas Mikolov et al. (2013)

    Word2Vec introduced the concept of word embeddings: training neural networks on large text corpora so semantically similar words cluster in vector space. The famous "king - man + woman ≈ queen" analogy demonstrated its power, laying the foundation for embedding layers in all subsequent language models.

  • abdin2024-phi3

    3.8B Phi-3-mini approaches GPT-3.5 on multiple benchmarks; continues to validate "high-quality synthesis + curriculum data" Phi recipe. Representative work for edge/local large models.

  • ai2024-yi

    Kai-Fu Lee's 01.AI full-stack technical report, emphasizing "small but strong + data quality". Yi-34B long-term first-tier open-source Chinese-English LLM, also early representative of 200K context open-source models.

  • azar2023-ipo

    Unifies RLHF/DPO with Ψ-PO framework, points out DPO overfits under BT assumption; proposes more robust IPO loss. Theoretical must-read for understanding "why DPO doesn't always work"; see also KTO, SimPO.

  • alayrac2022-flamingo

    Uses Perceiver Resampler to connect image features to frozen LLM for few-shot visual QA. Ancestor of mainstream "plug-in multimodal" approach (LLaVA, IDEFICS, etc.).

  • bai2022-hh

    Anthropic's early RLHF paper, HH-RLHF dataset since then became "MNIST" of open-source alignment research. Earliest systematic work understanding helpful vs harmless tension.

  • bai2023-qwen

    Alibaba Qwen's first complete technical report, covering 1.8B-72B full range, emphasizing Chinese-English bilingual + tokenizer friendliness. Representative foundation for Chinese open-source LLM series; subsequent Qwen2/2.5 are 2024-2025 open-source SOTA.

  • borgeaud2022-retro

    DeepMind introduces chunked retrieval during pre-training, making 7B model match 175B GPT-3. Proves retrieval isn't just RAG inference trick, but another possible pre-training paradigm.

  • chen2021-humaneval

    Proposes Codex model + HumanEval benchmark (164 programming problems). HumanEval remains "ECG metric" for coding models today; this paper is also root of GitHub Copilot.

  • chen2023-longlora

    Uses shifted sparse attention + LoRA to extend 7B model to 100K context with just one 8xA100 machine. Engineering benchmark for long-context fine-tuning; see also YaRN, PoSE.

  • chen2023-spec-sampling

    DeepMind's concurrent independent proposal of speculative sampling, theoretically proving acceleration while preserving sampling distribution. Together with Leviathan sets the direction; see also later Medusa, EAGLE.

  • chowdhery2022-palm

    Google's 540B model demonstrating "emergent" behaviors from larger scale (multi-step reasoning, joke explanation), first large-scale use of Pathways system. An independent engineering path after GPT-3.

  • clark2020-electra

    Uses replaced token detection instead of MLM, allowing small models to achieve BERT-large level performance. Representative work on "pre-training objective determines sample efficiency".

  • dao2023-flashattention2

    Uses more aggressive warp-level parallelism and work partitioning to double FlashAttention performance. Today vLLM/SGLang/Megatron training backends have all upgraded to FA-2.

  • deepseek2024-v2

    Introduces Multi-head Latent Attention (MLA) reducing KV cache to 1/13, making 236B MoE inference price crush same-tier closed-source. MLA is core source of V3/R1 inference cost-effectiveness.

  • deepseek2024-v3

    671B parameters (37B activated) MoE, 14.8T token training; first large-scale production LLM to run FP8 training + Multi-Token Prediction, compressing training cost to $5.6M. Shook entire industry.

  • dettmers2022-llmint8

    Reveals "emergent outliers" in large model activations and proposes mixed-precision solution. Core work behind bitsandbytes library, first enabling 175B models to fit in 8 A100s.

  • dettmers2023-qlora

    4-bit NF4 + LoRA + paged optimizer enables SFT of 65B on single 48GB GPU. Open-source community fine-tuning of LLaMA-2/3, Qwen uses this approach almost 100%.

  • du2021-glam

    1.2T parameter MoE achieves GPT-3 quality with 1/3 training compute, early representative of MoE "cost-effectiveness wins". Mixtral/DeepSeek-V2/V3 are its spiritual descendants.

  • frantar2022-gptq

    First to achieve "4-bit quantization of 175B model on single GPU with almost no accuracy loss". Lowered LLM inference hardware barrier from 8xA100 to single consumer GPU, popularizing "run open-source LLMs locally".

  • gao2022-hyde

    Makes LLM "pretend" to generate an answer first, then uses its embedding to retrieve real documents. Zero-shot, strong generalization—one of the most reused retrieval enhancement tricks in RAG era.

  • gemini2023-team

    Google's multimodal model family (Ultra/Pro/Nano), representative of "natively multimodal" narrative. 1.5 series later pushes context to 1M-10M tokens, benchmark for long-context industrial deployment.

  • gunasekar2023-phi1

    Microsoft uses 7B tokens of high-quality "textbook-level" synthetic data to train 1.3B model approaching GPT-3.5 on HumanEval. Takes "data quality >> data scale" story to extreme, launching Phi series.

  • hendrycks2020-mmlu

    57 subjects with 14K exam questions, since then "grinding MMLU" became de facto standard for measuring LLM general capability. Still first-line metric in model cards even in 2025; see also later MMLU-Pro.

  • howard2018-ulmfit

    First paper to explicitly propose the "universal language model pre-training → task fine-tuning" pipeline, with key tricks like discriminative LR and slanted triangular schedule. Together with ELMo, represents "the last mile before BERT".

  • jiang2023-mistral7b

    Uses GQA + sliding window attention to make 7B model outperform LLaMA-2 13B; first enters scene with "Apache 2.0 + direct weight release" stance. Leads European open-source LLM force.

  • jimenez2024-swebench

    Uses 12 real Python repos with 2294 issues to evaluate code models' "end-to-end bug solving" capability. Overnight became coding agent industry standard benchmark; almost every coding agent paper reports SWE-bench scores.

  • kalchbrenner2016-bytenet

    Uses dilated convolutions for seq2seq, liberating sequence modeling from "must use RNN sequential computation". Together with ConvS2S, represents the strongest attempt at parallel sequence modeling before Transformer.

  • karpukhin2020-dpr

    Dual-tower BERT + in-batch negatives trains first industrial-grade dense retriever, virtually eliminating BM25 overnight. Today's vector search (FAISS, pgvector) engineering paradigm solidified here.

  • kim2014-textcnn

    Uses CNN with pre-trained word vectors for text classification, proving "pre-trained embedding + simple architecture" beats hand-crafted features. An early sign of pre-training paradigm entering NLP.

  • kojima2022-zeroshot-cot

    A single phrase "Let's think step by step" boosts math accuracy from ~17% to ~78%. CoT capability is inherent in models, triggered by prompts—this discovery shocked the entire community.

  • kwon2023-vllm

    Introduces OS "paged memory" concept to KV cache, virtually eliminating OOM waste and multiplying throughput 2-4x. vLLM thereby becomes de facto standard open-source inference engine; compute foundation for MCP/Agent era.

  • lee2023-rlaif

    Google systematically proves RLAIF can match RLHF on various tasks, providing engineering evidence for "AI feedback replacing human" as scalable alignment solution.

  • leviathan2023-spec-decoding

    Uses small draft model to predict multiple tokens, large model verifies in one pass, achieving nearly lossless 2-3x speedup. Standard technique in all inference engines (vLLM, TensorRT-LLM) today.

  • liang2022-helm

    Stanford CRFM systematically evaluates 30+ LLMs × multidimensional metrics (accuracy, robustness, fairness, efficiency...), establishing "evaluation science". Representative work against "only looking at average scores".

  • lin2023-awq

    Discovers "few critical weights correspond to large activations", applies per-channel scaling by importance. More robust and faster than GPTQ at 4-bit, one of mainstream INT4 deployment solutions today.

  • liu2019-roberta

    Uses more data, longer training, removes NSP to prove BERT was far from fully trained. Important not just for stronger model, but for first clearly demonstrating that "training recipe" itself is a core research question.

  • liu2023-llava

    CLIP vision encoder + LLaMA + GPT-4 synthesized multimodal instruction data creates first open-source GPT-4V style model with minimal compute. Starting point for open-source multimodal ecosystem (LLaVA-1.5/1.6, Qwen-VL, InternVL).

  • luong2015-attention

    Systematically compares global vs local attention and different scoring functions (dot/general/concat). The most commonly cited engineering reference when explaining "how attention scores are computed".

  • mikolov2013-skipgram-negsampling

    The NeurIPS version of word2vec, introducing Negative Sampling, Hierarchical Softmax, and phrase-level vectors. Influenced all subsequent embedding training objectives including GloVe, fastText, and modern LLM embedding layers.

  • peng2023-yarn

    Applies NTK-aware interpolation + temperature correction on RoPE, extending context to 64K-128K with minimal training. Most open-source models today use YaRN or variants for length extension.

  • perez2022-redteaming

    DeepMind uses one LLM to automatically generate attack prompts for red-teaming another LLM, engineering red-teaming. Safety/jailbreak research since then shifted from "manual prompt search" to automated paradigm.

  • press2021-alibi

    Converts position information into linear bias on attention, enabling extrapolation to several times training length with zero parameters. Representative early long-context solution, competing with RoPE as two alternative approaches.

  • qwen2024-qwen25

    18T token pre-training, 0.5B-72B full suite + specialized Coder/Math sub-families; one of most stable open-source Chinese-English LLMs of 2024-2025. Long-term top in Hugging Face downloads/fine-tuning.

  • radford2021-clip

    Uses 400M image-text pairs for contrastive learning to obtain universal vision encoder. CLIP embeddings remain the vision frontend for almost all multimodal systems (DALL·E, Stable Diffusion, LLaVA) today.

  • schick2023-toolformer

    Makes model generate "API-calling tokens" itself and evaluate usefulness through self-supervision. Foundational paper for function-calling/tool-use training paradigm, directly influencing GPT-4 function calling design.

  • shah2024-flashattention3

    Leverages H100's async TMA and FP8 to push attention to 1.2 PFLOPs while maintaining numerical precision. Key dependency for long-context + FP8 training on Hopper architecture.

  • shazeer2019-mqa

    Proposes Multi-Query Attention: all heads share the same K/V, reducing KV cache usage to 1/h. All modern KV cache optimization and long-context inference stories start from this 5-page paper.

  • shinn2023-reflexion

    Makes agent do natural language "post-mortem" after failure, injecting reflection into next round's prompt. "Gradient-free self-improvement" approach widely reused in coding agents, SWE-agent.

  • snell2024-test-time-compute

    Systematically presents scaling laws for "spending more compute at inference time": with fixed budget, adding inference-time search to small models often more cost-effective than training larger models. Theoretical foundation for o1/R1 era.

  • stiennon2020-summarize

    OpenAI's first application of RLHF to large language models (summarization), proving RLHF systematically better than SFT/MLE on human preferences. Direct predecessor to InstructGPT.

  • touvron2023-llama

    Meta implements "small but precise + massive tokens" Chinchilla recipe and opens weights. LLaMA 1 directly catalyzed open-source LLM explosion (Alpaca/Vicuna/Mistral/Qwen all benefited).

  • touvron2023-llama2

    First commercially licensed high-quality open-source chat model, publicly shares RLHF recipe (PPO + GAtt). Directly advances open-source ecosystem to "near ChatGPT experience" stage.

  • wang2022-self-instruct

    Uses GPT-3 to generate instruction-output data and distill to itself. Stanford Alpaca/Vicuna both based on this, opening "use large models to generate data for training small models" synthetic data era.

  • wei2023-jailbroken

    Systematically classifies jailbreak methods (out-of-distribution, goal conflict) and explains why RLHF struggles to eradicate them. Reference material for jailbreak research "taxonomy".

  • xiao2022-smoothquant

    Moves activation outliers to weights through equivalent mathematical transformation, making INT8 inference feasible. Key engineering discovery enabling GPU FP8/INT8 deployment.

  • yang2019-xlnet

    Proposes Permutation LM to merge benefits of AR and AE, combined with Transformer-XL for long sequences. Shows "pre-training objective" is still an open question, most imaginative alternative after BERT.

  • yang2024-sweagent

    Proposes ACI (Agent-Computer Interface) concept, emphasizing "what tools/interface agent uses ≥ what model used". GPT-4 + good ACI improves SWE-bench 6x, establishing coding agent engineering methodology.

  • zeng2022-glm130b

    Tsinghua + Zhipu's open Chinese-English bilingual 130B model, earliest representative technical report of Chinese LLM industrialization. Subsequent ChatGLM-6B/9B pushed open-source Chinese dialogue to mass adoption.

  • zhou2022-least-to-most

    "Break hard problems into easy ones, solve sequentially" is another reasoning paradigm parallel to CoT, especially effective for compositional generalization. Together with CoT/ToT forms the trio of "how to guide LLM step-by-step thinking".

  • zheng2023-mtbench

    Proposes GPT-4-as-judge + human preference crowdsourcing (Chatbot Arena) for evaluating dialogue capability. MT-Bench and Arena ELO remain community's de facto dual standards for comparing model "dialogue capability" today.

  • zou2023-universal-attack

    Uses GCG algorithm to find gibberish suffix that breaks through aligned LLaMA-2/Vicuna, with attacks transferring across multiple closed-source models. Shocked entire security community, making "alignment fragility" mainstream topic.