Sampling and Decoding: From Probabilities to Text

Intuition: the model gives probabilities; the strategy chooses

An LLM does not output a single deterministic answer at each step. It produces a probability table over the next token. Decoding strategy decides how to pick from that table. Greedy decoding always picks the highest-probability token; it is stable but tends to be flat. Sampling introduces randomness, which is better for writing, brainstorming, and diverse candidates.

Temperature sharpens or flattens the distribution; top-k keeps only the k highest-probability tokens; top-p keeps the smallest set whose cumulative probability reaches p. They are not model capabilities; they are control knobs at inference time.

Temperature 0.80Top-k 10Top-p 0.90

下一个 token：的（19.4%）

的

19.4%

是

15.5%

在

13.1%

模型

11.9%

可以

9.5%

答案

8.1%

因为

7.3%

用户

5.8%

上下文

5.0%

生成

4.5%

Engineering view: reliability comes from constraints and evaluation

Production systems rarely adjust only temperature. Structured output combines JSON schema, tool calling, or constrained decoding; QA systems lower randomness and add citations; creative systems allow more diversity. Chain-of-thought prompting changes the model’s internal reasoning trajectory, but also increases token cost and the risk of leaking intermediate errors.

When evaluating, record decoding parameters, because the same model can behave very differently under different temperature and top-p settings. Online regression tests should fix random seeds or use deterministic strategies; open-ended products should measure diversity, factuality, refusal rate, and user satisfaction together.

Example code: temperature, top-k, and top-p sampling

import numpy as np

def apply_temperature(logits, temperature=1.0):
    """Apply temperature scaling to logits"""
    return logits / temperature

def top_k_filtering(logits, k):
    """Filter out all but the top k tokens"""
    top_k_indices = np.argsort(logits)[-k:]
    filtered_logits = np.full_like(logits, -np.inf)
    filtered_logits[top_k_indices] = logits[top_k_indices]
    return filtered_logits

def top_p_filtering(logits, p):
    """Keep only tokens with cumulative probability >= p"""
    sorted_indices = np.argsort(logits)[::-1]
    sorted_logits = logits[sorted_indices]
    probs = np.exp(sorted_logits) / np.sum(np.exp(sorted_logits))
    cumulative_probs = np.cumsum(probs)

    # Find cutoff
    cutoff_idx = np.searchsorted(cumulative_probs, p) + 1
    filtered_logits = np.full_like(logits, -np.inf)
    filtered_logits[sorted_indices[:cutoff_idx]] = logits[sorted_indices[:cutoff_idx]]
    return filtered_logits

def sample_token(logits, temperature=1.0, top_k=None, top_p=None):
    """Sample next token with temperature, top-k, and top-p"""
    # Apply temperature
    logits = apply_temperature(logits, temperature)

    # Apply top-k filtering
    if top_k is not None:
        logits = top_k_filtering(logits, top_k)

    # Apply top-p filtering
    if top_p is not None:
        logits = top_p_filtering(logits, top_p)

    # Convert to probabilities and sample
    probs = np.exp(logits) / np.sum(np.exp(logits))
    return np.random.choice(len(probs), p=probs)

# Example: simulate next token prediction
vocab_size = 50
logits = np.random.randn(vocab_size)

print("Greedy (argmax):", np.argmax(logits))
print("Temperature=0.5:", sample_token(logits, temperature=0.5))
print("Temperature=1.5:", sample_token(logits, temperature=1.5))
print("Top-k=10:", sample_token(logits, top_k=10))
print("Top-p=0.9:", sample_token(logits, top_p=0.9))

Research view: decoding as search

Research shows that decoding strategy is essentially a trade-off between output quality and diversity, but recent work suggests that test-time compute can break this trade-off. By allowing the model to perform more reasoning steps during decoding (such as iterative correction, multi-path search, and verifier scoring), smaller models can match or even surpass larger models.

Key questions include: does the optimal decoding strategy vary by task? Is there a task-agnostic universal decoding algorithm? And from an information-theoretic perspective, what is the relationship between sampling temperature and model confidence? Answers to these questions will shape future LLM inference architecture design.

🔬 Open Research Questions

Key questions and research directions in this area:

What is the optimal allocation strategy for test-time compute? How can we characterize the Pareto frontier between model scale and inference time?

Related: snell2024 test
How can we balance constrained decoding with autoregressive generation? Can we guarantee output format correctness without sacrificing too much diversity?

Related: willard2023 constrained
What is the theoretical upper bound of speedup for speculative sampling? What is the optimal strategy for selecting the small model?

Related: chen2023 spec

References

Language Models are Few-Shot Learners — Tom Brown et al. (2020)
OpenAI's GPT-3 paper demonstrates that a 175B parameter language model can perform diverse tasks through few-shot in-context learning without fine-tuning. It established the paradigm that scale unlocks emergent capabilities and launched the era of prompt engineering.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Jason Wei et al. (2022)
Introduces chain-of-thought prompting: adding intermediate reasoning steps to prompts dramatically improves LLM performance on math, logic, and commonsense reasoning tasks. This simple technique brought LLM reasoning capabilities close to human-level performance.
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Charlie Snell et al. (2024)
Systematically presents scaling laws for "spending more compute at inference time": with fixed budget, adding inference-time search to small models often more cost-effective than training larger models. Theoretical foundation for o1/R1 era.
Efficient Guided Generation for Large Language Models — Brandon T. Willard et al. (2023)
Proposes efficient constrained decoding that enforces JSON Schema, regular expressions, or context-free grammars during generation. Converts syntax constraints into finite-state automata, guaranteeing correct output format with minimal latency overhead.
Accelerating Large Language Model Decoding with Speculative Sampling — Charlie Chen et al. (2023)
DeepMind's concurrent independent proposal of speculative sampling, theoretically proving acceleration while preserving sampling distribution. Together with Leviathan sets the direction; see also later Medusa, EAGLE.
Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan et al. (2023)
Uses small draft model to predict multiple tokens, large model verifies in one pass, achieving nearly lossless 2-3x speedup. Standard technique in all inference engines (vLLM, TensorRT-LLM) today.

Sampling and Decoding: From Probabilities to Text

Intuition: the model gives probabilities; the strategy chooses

Engineering view: reliability comes from constraints and evaluation

Example code: temperature, top-k, and top-p sampling

Research view: decoding as search

🔬 Open Research Questions

Related Reading

Transformer Architecture: The Skeleton of Modern LLMs

KV Cache and Quantization: Making Large Models Faster

Prompt Engineering: The Art of Talking to Models

Tokenization: How Models See Text

References