Positional Encoding: Where Does Order Come From

Intuition: attention alone does not know “which word”

Self-attention compares all tokens at once. Without position information, “I love you” and “you love I” would be hard to distinguish. Positional encoding tells the model where each token sits in the sequence, so it can understand order, distance, and local structure.

The original Transformer used sinusoidal absolute positional encodings. Later models moved toward relative-position ideas. RoPE bakes position into Query and Key via rotation, so attention scores naturally include relative distance and generalize better to longer contexts.

Engineering view: position schemes affect extrapolation

Absolute position embeddings are simple to implement, but extrapolation beyond training length is usually poor. Relative position bias, ALiBi, RoPE, and other schemes try to let the model handle unseen lengths more stably. Real long-context systems also combine interpolation, scaling, continued pretraining, and retrieval augmentation.

Positional encoding is not an isolated module: it interacts with tokenizer, training length, attention kernels, KV cache, and evaluation sets. Before expanding the context window, test needle-in-a-haystack, long-document QA, code navigation, and multi-hop reasoning—not just whether the model accepts longer input.

Example code: sinusoidal positional encoding

import numpy as np

def sinusoidal_positional_encoding(seq_len, d_model):
    """
    Generate sinusoidal positional encoding
    seq_len: sequence length
    d_model: model dimension
    Returns: [seq_len, d_model]
    """
    position = np.arange(seq_len)[:, np.newaxis]  # [seq_len, 1]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    pe = np.zeros((seq_len, d_model))
    # Apply sin to even indices
    pe[:, 0::2] = np.sin(position * div_term)
    # Apply cos to odd indices
    pe[:, 1::2] = np.cos(position * div_term)

    return pe

# Example usage
seq_len, d_model = 10, 64
pe = sinusoidal_positional_encoding(seq_len, d_model)
print(f"Positional encoding shape: {pe.shape}")

# Verify properties
print(f"Position 0 encoding: {pe[0, :4]}")
print(f"Position 1 encoding: {pe[1, :4]}")

# Check relative position relationship
# Positions that are the same distance apart should have similar relationships
dist_0_1 = np.dot(pe[0], pe[1])
dist_1_2 = np.dot(pe[1], pe[2])
print(f"Dot product of adjacent positions: {dist_0_1:.3f}, {dist_1_2:.3f}")

Research view: theoretical limits of positional encoding

The core research question is: how can models generalize to lengths unseen during training? Sinusoidal encoding has a clear closed form but poor extrapolation; RoPE achieves relative positional encoding through rotation matrices, and can be extended to some degree with interpolation or scaling. ALiBi directly adds a linearly proportional bias to attention scores based on distance, remaining simple and stable during extrapolation.

A deeper question is: must position information be added explicitly? Some studies suggest that in sufficiently deep networks, models can indirectly infer position from statistical regularities in attention patterns. Moreover, architectures without positional encoding (such as certain state-space models) also demonstrate sequential modeling capability, challenging the traditional assumption that positional encoding is essential.

🔬 Open Research Questions

Key questions and research directions in this area:

Why does RoPE's rotation matrix formulation naturally support relative position encoding? What are its extrapolation limits?

Related: su2021 rope
What are the fundamental differences in training dynamics between ALiBi's linear bias approach and explicit position encodings (e.g., sinusoidal/RoPE)?

Related: press2021 alibi
In ultra-long context scenarios, is position encoding still the bottleneck? Are there alternative architectures that do not require position encoding?

Related: yang2019 xlnet

References

Attention Is All You Need — Ashish Vaswani et al. (2017)
The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.
RoFormer: Enhanced Transformer with Rotary Position Embedding — Jianlin Su et al. (2021)
RoPE (Rotary Position Embedding) is the position encoding scheme used in most major LLMs today (LLaMA, Mistral, Qwen, etc.). By incorporating position information as rotation matrices in attention computation, it elegantly handles relative positions and generalizes much better than absolute position encoding when extrapolating to longer context lengths.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation — Ofir Press et al. (2021)
Converts position information into linear bias on attention, enabling extrapolation to several times training length with zero parameters. Representative early long-context solution, competing with RoPE as two alternative approaches.
XLNet: Generalized Autoregressive Pretraining for Language Understanding — Zhilin Yang et al. (2019)
Proposes Permutation LM to merge benefits of AR and AE, combined with Transformer-XL for long sequences. Shows "pre-training objective" is still an open question, most imaginative alternative after BERT.
GLM-130B: An Open Bilingual Pre-trained Model — Aohan Zeng et al. (2022)
Tsinghua + Zhipu's open Chinese-English bilingual 130B model, earliest representative technical report of Chinese LLM industrialization. Subsequent ChatGLM-6B/9B pushed open-source Chinese dialogue to mass adoption.

Positional Encoding: Where Does Order Come From

Intuition: attention alone does not know “which word”

Engineering view: position schemes affect extrapolation

Example code: sinusoidal positional encoding

Research view: theoretical limits of positional encoding

🔬 Open Research Questions

Related Reading

Attention: Choosing the Relevant Context

Transformer Architecture: The Skeleton of Modern LLMs

Long Context: Helping Models Read Farther

Tokenization: How Models See Text

References