Skip to content

Positional Encoding:顺序从哪里来

This content is not available in your language yet.

直觉版:注意力本身不知道第几个词

Section titled “直觉版:注意力本身不知道第几个词”

自注意力把一组 token 同时拿来比较,如果不加入位置信息,“我爱你”和“你爱我”会很难区分。位置编码就是告诉模型每个 token 在序列里的位置,让它理解顺序、距离和局部结构。

原始 Transformer 使用正弦/余弦绝对位置编码;后续模型更多使用相对位置思想。RoPE 把位置信息融入 Query/Key 的旋转中,使注意力分数自然包含相对距离,对长上下文扩展更友好。

绝对位置嵌入实现简单,但训练长度之外的外推通常较差。相对位置偏置、ALiBi、RoPE 等方案试图让模型更稳定地处理未见过的长度。实际长上下文系统还会配合插值、缩放、继续训练和检索增强。

位置编码不是孤立模块:它与 tokenizer、训练长度、注意力 kernel、KV cache 和评测集共同决定效果。调大 context window 前,应测试“needle-in-a-haystack”、长文问答、代码定位和多跳依赖,而不仅看模型能否接受更长输入。

References

  • Attention Is All You Need — Ashish Vaswani et al. (2017)

    The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.

  • RoFormer: Enhanced Transformer with Rotary Position Embedding — Jianlin Su et al. (2021)

    RoPE (Rotary Position Embedding) is the position encoding scheme used in most major LLMs today (LLaMA, Mistral, Qwen, etc.). By incorporating position information as rotation matrices in attention computation, it elegantly handles relative positions and generalizes much better than absolute position encoding when extrapolating to longer context lengths.