Attention: Choosing the Relevant Context
Intuition: focus changes with the token
Section titled “Intuition: focus changes with the token”Attention lets each position decide which previous or neighboring tokens matter most. When translating a pronoun, the model may attend to a noun; when answering a question, it may attend to supporting evidence. The weights are dynamic: they depend on the actual input, not on a fixed rule.
行表示正在更新的 token,列表示它关注的上下文 token;颜色越深,权重越高。
| query \ key | 大 | 语言 | 模型 | 关注 | 上下文 | 生成 | 答案 |
|---|---|---|---|---|---|---|---|
| 大 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 语言 | 0.25 | 0.75 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 模型 | 0.12 | 0.38 | 0.50 | 0.00 | 0.00 | 0.00 | 0.00 |
| 关注 | 0.08 | 0.24 | 0.28 | 0.40 | 0.00 | 0.00 | 0.00 |
| 上下文 | 0.05 | 0.18 | 0.22 | 0.25 | 0.30 | 0.00 | 0.00 |
| 生成 | 0.04 | 0.12 | 0.20 | 0.18 | 0.26 | 0.20 | 0.00 |
| 答案 | 0.03 | 0.10 | 0.22 | 0.12 | 0.23 | 0.18 | 0.12 |
Engineering view: Q, K, V and the quadratic bottleneck
Section titled “Engineering view: Q, K, V and the quadratic bottleneck”Each token is projected into Query, Key, and Value vectors. Query-Key similarities become a softmax distribution; the output is a weighted sum of Values. Multi-head attention repeats this operation so different heads can learn different relations.
Full self-attention scales roughly with sequence length squared. Inference systems therefore use KV cache, efficient kernels, grouped-query attention, sliding windows, or sparse patterns to keep latency and memory manageable.
References
- Attention Is All You Need
The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.
- Neural Machine Translation by Jointly Learning to Align and Translate
The seminal attention mechanism paper (pre-Transformer). The authors found that seq2seq's fixed-length bottleneck vector limited translation quality, and proposed letting the decoder dynamically attend to all encoder hidden states when generating each word. This idea directly evolved into Transformer self-attention.