Sampling 与 Decoding：从概率到文字

直觉版：模型给概率，人来定策略

LLM 每一步输出的不是一个确定答案，而是一张“下一个 token 概率表”。解码策略决定如何从这张表里选 token。贪心解码总选最高概率，稳定但容易平；采样会引入随机性，更适合写作、头脑风暴和多样化候选。

Temperature 0.80Top-k 10Top-p 0.90

下一个 token：的（19.4%）

的

19.4%

是

15.5%

在

13.1%

模型

11.9%

可以

9.5%

答案

8.1%

因为

7.3%

用户

5.8%

上下文

5.0%

生成

4.5%

Temperature 调整分布尖锐程度；top-k 只保留概率最高的 k 个 token；top-p 保留累计概率达到 p 的最小集合。它们不是模型能力本身，而是推理时的控制旋钮。

解码策略对比图：

graph LR
    A[模型输出 logits] --> B{解码策略}

    B --> C[Greedy Decoding]
    B --> D[Beam Search]
    B --> E[随机采样]

    C --> C1[选择最高概率 token]

    D --> D1[维护k个候选序列]
    D1 --> D2[每步扩展并保留最优k个]

    E --> E1[应用 Temperature]
    E1 --> E2{"Top-K / Top-P"}
    E2 --> E3[从过滤后的分布采样]

    style C fill:#ff9,stroke:#333
    style D fill:#9ff,stroke:#333
    style E fill:#f9f,stroke:#333

核心公式：

Softmax with Temperature：

$P(x_i) = \frac{\exp(x_i / T)}{\sum_j \exp(x_j / T)}$

其中 $T$ 是 temperature 参数：

$T \to 0$ ：趋近于贪心解码
$T = 1$ ：标准 softmax
$T > 1$ ：分布更平坦，增加随机性

Top-P (Nucleus) 采样：

$V^{(p)} = \{w_i \mid \sum_{j=1}^{i} P(w_j) \leq p\}$

$w_{next} \sim \text{Uniform}(V^{(p)})$

工程版：可靠性来自约束与评估

生产应用通常不会只调一个 temperature。结构化输出会结合 JSON schema、工具调用或受限解码；问答系统会降低随机性并增加引用；创作系统会允许更高多样性。Chain-of-thought 提示能改变模型内部推理轨迹，但也会增加 token 成本和泄露中间错误的风险。

评估时要记录解码参数，因为同一模型在不同 temperature、top-p 下可能表现完全不同。线上回归测试应固定随机种子或使用确定性策略；开放式产品则应衡量多样性、事实性、拒答率和用户满意度的综合结果。

示例代码：实现 temperature、top-k、top-p 采样

可运行示例

import numpy as np

def softmax(logits, temperature=1.0):
    """应用 temperature 的 softmax"""
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
    return exp_logits / exp_logits.sum()

def top_k_sampling(logits, k, temperature=1.0):
    """Top-k 采样：只保留概率最高的 k 个 token"""
    top_k_idx = np.argsort(logits)[-k:]
    filtered_logits = np.full_like(logits, -np.inf)
    filtered_logits[top_k_idx] = logits[top_k_idx]
    probs = softmax(filtered_logits, temperature)
    return np.random.choice(len(probs), p=probs)

def top_p_sampling(logits, p, temperature=1.0):
    """Top-p (nucleus) 采样：保留累计概率达到 p 的最小集合"""
    probs = softmax(logits, temperature)
    sorted_idx = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_idx]
    cumsum_probs = np.cumsum(sorted_probs)

    # 找到累计概率超过 p 的截断点
    cutoff_idx = np.searchsorted(cumsum_probs, p) + 1
    nucleus_idx = sorted_idx[:cutoff_idx]

    # 重新归一化并采样
    nucleus_probs = probs[nucleus_idx]
    nucleus_probs = nucleus_probs / nucleus_probs.sum()
    return np.random.choice(nucleus_idx, p=nucleus_probs)

# 示例
logits = np.array([2.0, 1.0, 0.5, 0.1, -1.0])  # 5个token的logits
print("贪心选择:", np.argmax(logits))
print("Top-k (k=3) 采样:", top_k_sampling(logits, k=3, temperature=0.8))
print("Top-p (p=0.9) 采样:", top_p_sampling(logits, p=0.9, temperature=1.0))

研究版：解码即搜索

研究上，解码策略的选择本质上是在”输出质量”与”多样性”之间的权衡，但近年来的工作表明，测试时计算（test-time compute）可以打破这一权衡。通过让模型在解码时进行更多推理步骤（如反复修正、多路径搜索、验证器打分），小模型可能达到甚至超过大模型的表现。

关键问题包括：最优的解码策略是否因任务而异？是否存在任务无关的通用解码算法？以及，从信息论角度看，采样温度与模型置信度之间有什么关系？这些问题的答案将决定未来 LLM 的推理架构设计。

🔬 开放研究问题

该领域的关键问题与研究方向：

Test-time compute 的最优分配策略是什么？模型规模与推理时间的帕累托前沿如何刻画？

相关论文： snell2024 test
约束解码（constrained decoding）与自回归生成之间的矛盾如何平衡？能否在不牺牲太多多样性的前提下保证输出格式正确？

相关论文： willard2023 constrained
投机采样（speculative sampling）的加速比理论上界是多少？小模型选择的最佳策略是什么？

相关论文： chen2023 spec

本文引用论文

Language Models are Few-Shot Learners — Tom Brown et al. (2020)
OpenAI 的 GPT-3 论文，展示了 1750 亿参数的语言模型通过 few-shot in-context learning 能在无需微调的情况下完成各种任务。这篇论文确立了"规模即能力"的范式，并开创了提示工程这个方向。
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Jason Wei et al. (2022)
提出 Chain-of-Thought（思维链）提示技术：通过在提示中加入中间推理步骤，可以大幅提升大语言模型在数学、逻辑、常识推理等任务上的表现。这个简单技巧把 LLM 的推理能力推向了接近人类的水平。
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Charlie Snell et al. (2024)
系统性给出"推理时多花 compute"的 scaling law：在固定预算下，对小模型加推理时搜索往往比训练更大模型更划算。是 o1/R1 时代理论支撑。
Efficient Guided Generation for Large Language Models — Brandon T. Willard et al. (2023)
提出高效的约束解码方法，让大语言模型在生成过程中实时遵守 JSON Schema、正则表达式或上下文无关文法。通过将语法约束转化为有限状态自动机，在几乎不增加延迟的情况下保证输出格式正确。
Accelerating Large Language Model Decoding with Speculative Sampling — Charlie Chen et al. (2023)
DeepMind 同期独立提出 speculative sampling，理论上证明可在保持采样分布不变的前提下加速。和 Leviathan 一起为这条路线定调；另见 Medusa、EAGLE 等后续。
Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan et al. (2023)
用一个小 draft model 预测多个 token，再让大模型一次校验，几乎无损地获得 2-3x 加速。是当下所有推理引擎（vLLM、TensorRT-LLM）的标配技术之一。

Sampling 与 Decoding：从概率到文字

直觉版：模型给概率，人来定策略

工程版：可靠性来自约束与评估

示例代码：实现 temperature、top-k、top-p 采样

研究版：解码即搜索

🔬 开放研究问题

相关阅读

Transformer Architecture：现代 LLM 的骨架

KV Cache 与量化：让大模型跑得更快

提示工程：与模型对话的艺术

Tokenization：模型如何看见文字

本文引用论文