Skip to content

Transformer Architecture: The Skeleton of Modern LLMs

Intuition: repeated layers of communication and transformation

Section titled “Intuition: repeated layers of communication and transformation”

A Transformer stacks similar blocks. Attention lets tokens exchange information; the feed-forward network transforms each token representation; residual connections and normalization keep optimization stable. With enough layers, the model combines local syntax, long-range dependencies, and abstract patterns.

BERT popularized bidirectional encoders for understanding tasks, while GPT-2 showed that decoder-only next-token prediction can learn broad generative abilities.

Engineering view: the critical path inside a decoder block

Section titled “Engineering view: the critical path inside a decoder block”

A modern decoder block usually contains normalization, causal self-attention, an MLP, and residual connections. The causal mask prevents a position from seeing future tokens, matching autoregressive generation. MLPs often dominate parameter count, while attention dominates long-context interaction cost.

Architecture choices such as RoPE, grouped-query attention, activation functions, MoE layers, and normalization placement affect both training stability and inference throughput. Compare systems by memory, KV-cache size, latency, and quality—not only parameter count.

References

  • Attention Is All You Need — Ashish Vaswani et al. (2017)

    The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.

  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Jacob Devlin et al. (2018)

    BERT uses masked language modeling (MLM) and next sentence prediction to pretrain a bidirectional Transformer on large text corpora, then fine-tunes for downstream tasks. It simultaneously surpassed SOTA on 11 NLP benchmarks, establishing the "pretrain+finetune" paradigm that dominates modern NLP.

  • Language Models are Unsupervised Multitask Learners — Alec Radford et al. (2019)

    GPT-2 shows that a 1.5B parameter language model trained only on unlabeled web text can perform various language tasks zero-shot without fine-tuning. This challenged the convention that NLP tasks require task-specific training and famously became the first AI model "staged released" due to misuse concerns.