Skip to content

Attention: Choosing the Relevant Context

Attention lets each position decide which previous or neighboring tokens matter most. When translating a pronoun, the model may attend to a noun; when answering a question, it may attend to supporting evidence. The weights are dynamic: they depend on the actual input, not on a fixed rule.

行表示正在更新的 token,列表示它关注的上下文 token;颜色越深,权重越高。

query \ key语言模型关注上下文生成答案
1.000.000.000.000.000.000.00
语言0.250.750.000.000.000.000.00
模型0.120.380.500.000.000.000.00
关注0.080.240.280.400.000.000.00
上下文0.050.180.220.250.300.000.00
生成0.040.120.200.180.260.200.00
答案0.030.100.220.120.230.180.12

Engineering view: Q, K, V and the quadratic bottleneck

Section titled “Engineering view: Q, K, V and the quadratic bottleneck”

Each token is projected into Query, Key, and Value vectors. Query-Key similarities become a softmax distribution; the output is a weighted sum of Values. Multi-head attention repeats this operation so different heads can learn different relations.

Full self-attention scales roughly with sequence length squared. Inference systems therefore use KV cache, efficient kernels, grouped-query attention, sliding windows, or sparse patterns to keep latency and memory manageable.

References

  • Attention Is All You Need — Ashish Vaswani et al. (2017)

    The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.

  • Neural Machine Translation by Jointly Learning to Align and Translate — Dzmitry Bahdanau et al. (2014)

    The seminal attention mechanism paper (pre-Transformer). The authors found that seq2seq's fixed-length bottleneck vector limited translation quality, and proposed letting the decoder dynamically attend to all encoder hidden states when generating each word. This idea directly evolved into Transformer self-attention.