Transformer Architecture: The Skeleton of Modern LLMs
Intuition: repeated layers of communication and transformation
Section titled “Intuition: repeated layers of communication and transformation”A Transformer stacks similar blocks. Attention lets tokens exchange information; the feed-forward network transforms each token representation; residual connections and normalization keep optimization stable. With enough layers, the model combines local syntax, long-range dependencies, and abstract patterns.
BERT popularized bidirectional encoders for understanding tasks, while GPT-2 showed that decoder-only next-token prediction can learn broad generative abilities.
Engineering view: the critical path inside a decoder block
Section titled “Engineering view: the critical path inside a decoder block”A modern decoder block usually contains normalization, causal self-attention, an MLP, and residual connections. The causal mask prevents a position from seeing future tokens, matching autoregressive generation. MLPs often dominate parameter count, while attention dominates long-context interaction cost.
Architecture choices such as RoPE, grouped-query attention, activation functions, MoE layers, and normalization placement affect both training stability and inference throughput. Compare systems by memory, KV-cache size, latency, and quality—not only parameter count.
References
- Attention Is All You Need
The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT uses masked language modeling (MLM) and next sentence prediction to pretrain a bidirectional Transformer on large text corpora, then fine-tunes for downstream tasks. It simultaneously surpassed SOTA on 11 NLP benchmarks, establishing the "pretrain+finetune" paradigm that dominates modern NLP.
- Language Models are Unsupervised Multitask Learners
GPT-2 shows that a 1.5B parameter language model trained only on unlabeled web text can perform various language tasks zero-shot without fine-tuning. This challenged the convention that NLP tasks require task-specific training and famously became the first AI model "staged released" due to misuse concerns.