Attention: Choosing the Relevant Context

Intuition: focus changes with the token

Attention lets each position decide which previous or neighboring tokens matter most. When translating a pronoun, the model may attend to a noun; when answering a question, it may attend to supporting evidence. The weights are dynamic: they depend on the actual input, not on a fixed rule.

行表示正在更新的 token，列表示它关注的上下文 token；颜色越深，权重越高。

query \ key	大	语言	模型	关注	上下文	生成	答案
大	1.00	0.00	0.00	0.00	0.00	0.00	0.00
语言	0.25	0.75	0.00	0.00	0.00	0.00	0.00
模型	0.12	0.38	0.50	0.00	0.00	0.00	0.00
关注	0.08	0.24	0.28	0.40	0.00	0.00	0.00
上下文	0.05	0.18	0.22	0.25	0.30	0.00	0.00
生成	0.04	0.12	0.20	0.18	0.26	0.20	0.00
答案	0.03	0.10	0.22	0.12	0.23	0.18	0.12

Engineering view: Q, K, V and the quadratic bottleneck

Each token is projected into Query, Key, and Value vectors. Query-Key similarities become a softmax distribution; the output is a weighted sum of Values. Multi-head attention repeats this operation so different heads can learn different relations.

Full self-attention scales roughly with sequence length squared. Inference systems therefore use KV cache, efficient kernels, grouped-query attention, sliding windows, or sparse patterns to keep latency and memory manageable.

Example code: simplified self-attention implementation

import numpy as np

def softmax(x):
    """Numerically stable softmax"""
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V):
    """
    Compute scaled dot-product attention
    Q, K, V: [seq_len, d_k]
    Returns: [seq_len, d_k]
    """
    d_k = Q.shape[-1]
    # Compute attention scores
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    # Apply softmax to get attention weights
    attention_weights = softmax(scores)
    # Weighted sum of values
    output = np.matmul(attention_weights, V)
    return output, attention_weights

# Example usage
seq_len, d_k = 4, 8
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights shape:", weights.shape)  # (4, 4)
print("Output shape:", output.shape)  # (4, 8)
print("Row sums of weights:", weights.sum(axis=1))  # Each row sums to 1.0

Research view: interpretability of attention patterns

Can attention weights themselves explain model decisions? Early work assumed attention provides transparent signals of “where the model is looking,” but subsequent research shows that attention distributions do not straightforwardly correspond to feature importance—models can maintain outputs despite high attention weights, and vice versa.

Deeper research directions include: specialization of different heads in multi-head attention (syntax heads, position heads, rare-token heads); dynamic evolution of attention patterns across layers (shallow vs. deep); and the relationship between attention and gradient-based attribution methods. Understanding these helps design sparser, more efficient, and more interpretable attention variants.

🔬 Open Research Questions

Key questions and research directions in this area:

Can attention weights reliably explain model decisions? How to accurately characterize the relationship between attention distribution and feature importance?

Related: vaswani2017 attention
How to quantify specialization of different heads in multi-head attention? Do universal "syntax heads" or "position heads" exist?

Related: vaswani2017 attention
How to design sparser, more efficient, and more interpretable attention variants to reduce computational cost?

References

Attention Is All You Need — Ashish Vaswani et al. (2017)
The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.
Neural Machine Translation by Jointly Learning to Align and Translate — Dzmitry Bahdanau et al. (2014)
The seminal attention mechanism paper (pre-Transformer). The authors found that seq2seq's fixed-length bottleneck vector limited translation quality, and proposed letting the decoder dynamically attend to all encoder hidden states when generating each word. This idea directly evolved into Transformer self-attention.
Effective Approaches to Attention-based Neural Machine Translation — Minh-Thang Luong et al. (2015)
Systematically compares global vs local attention and different scoring functions (dot/general/concat). The most commonly cited engineering reference when explaining "how attention scores are computed".
Sequence to Sequence Learning with Neural Networks — Ilya Sutskever et al. (2014)
The foundational seq2seq (encoder-decoder) architecture paper. Using two LSTMs in a compress-then-generate structure, it enabled neural networks to perform variable-length sequence-to-sequence transformations for the first time, achieving breakthroughs in machine translation and directly inspiring the Transformer's encoder-decoder design.
Neural Machine Translation in Linear Time — Nal Kalchbrenner et al. (2016)
Uses dilated convolutions for seq2seq, liberating sequence modeling from "must use RNN sequential computation". Together with ConvS2S, represents the strongest attempt at parallel sequence modeling before Transformer.

Attention: Choosing the Relevant Context

Intuition: focus changes with the token

Engineering view: Q, K, V and the quadratic bottleneck

Example code: simplified self-attention implementation

Research view: interpretability of attention patterns

🔬 Open Research Questions

Related Reading

Efficient Attention: Breaking the Quadratic Sequence Bottleneck

Long Context: Helping Models Read Farther

Transformer Architecture: The Skeleton of Modern LLMs

Tokenization: How Models See Text

References