Transformer Architecture: The Skeleton of Modern LLMs

Intuition: repeated layers of communication and transformation

A Transformer stacks similar blocks. Attention lets tokens exchange information; the feed-forward network transforms each token representation; residual connections and normalization keep optimization stable. With enough layers, the model combines local syntax, long-range dependencies, and abstract patterns.

BERT popularized bidirectional encoders for understanding tasks, while GPT-2 showed that decoder-only next-token prediction can learn broad generative abilities.

Engineering view: the critical path inside a decoder block

A modern decoder block usually contains normalization, causal self-attention, an MLP, and residual connections. The causal mask prevents a position from seeing future tokens, matching autoregressive generation. MLPs often dominate parameter count, while attention dominates long-context interaction cost.

Architecture choices such as RoPE, grouped-query attention, activation functions, MoE layers, and normalization placement affect both training stability and inference throughput. Compare systems by memory, KV-cache size, latency, and quality—not only parameter count.

Example code: simplified Transformer block

import numpy as np

class TransformerBlock:
    """Simplified Transformer decoder block (runnable educational demo)"""

    def __init__(self, d_model, n_heads, d_ff):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_ff = d_ff
        # Initialize simplified weight matrices
        # (In real training use Xavier/Kaiming initialization)
        rng = np.random.default_rng(0)
        self.W_q = rng.normal(0, 0.01, (d_model, d_model))
        self.W_k = rng.normal(0, 0.01, (d_model, d_model))
        self.W_v = rng.normal(0, 0.01, (d_model, d_model))
        self.W_o = rng.normal(0, 0.01, (d_model, d_model))
        self.W1 = rng.normal(0, 0.01, (d_model, d_ff))
        self.W2 = rng.normal(0, 0.01, (d_ff, d_model))

    def layer_norm(self, x):
        """Layer Normalization"""
        mean = x.mean(axis=-1, keepdims=True)
        std = x.std(axis=-1, keepdims=True)
        return (x - mean) / (std + 1e-5)

    def self_attention(self, x):
        """Simplified single-head scaled dot-product attention"""
        # x: [seq_len, d_model]
        Q = x @ self.W_q  # [seq_len, d_model]
        K = x @ self.W_k
        V = x @ self.W_v

        # Scaled dot-product: softmax(Q·K^T / sqrt(d)) · V
        scores = Q @ K.T / np.sqrt(self.d_model)  # [seq_len, seq_len]
        # Numerical stability: subtract max before exp
        exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
        attn_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
        return attn_weights @ V @ self.W_o

    def feed_forward(self, x):
        """Feed-forward: d_model → d_ff → d_model with ReLU activation"""
        # FFN(x) = W2 · ReLU(W1 · x)
        hidden = np.maximum(0, x @ self.W1)  # ReLU
        return hidden @ self.W2

    def forward(self, x):
        """
        Transformer block forward pass
        x: [seq_len, d_model]
        """
        # 1. Pre-norm + self-attention + residual
        residual = x
        x = self.layer_norm(x)
        x = self.self_attention(x)
        x = x + residual

        # 2. Pre-norm + FFN + residual
        residual = x
        x = self.layer_norm(x)
        x = self.feed_forward(x)
        x = x + residual

        return x

# Example usage
seq_len, d_model = 10, 512
block = TransformerBlock(d_model=512, n_heads=8, d_ff=2048)
x = np.random.randn(seq_len, d_model)
output = block.forward(x)
print(f"Input shape: {x.shape}, Output shape: {output.shape}")

Research view: architecture evolution and mixture of experts

The dominance of decoder-only architectures was not preordained, but an empirical choice. Encoder-decoder models like T5 still hold advantages in translation and summarization, while pure decoders excel in generation simplicity and scaling convenience.

Mixture of Experts (MoE) is a hot topic in architecture research: through sparse activation, models can expand parameter count without increasing inference compute. But MoE introduces new challenges in routing stability, load balancing, communication overhead, and fine-tuning difficulty. Future architectures may be modular, composable systems that dynamically select sub-networks per task, rather than today’s “one giant model for everything.”

🔬 Open Research Questions

Key questions and research directions in this area:

Which components of the Transformer architecture are essentially necessary? Can they be further simplified or replaced?

Related: vaswani2017 attention
Pre-LN vs Post-LN: What is the mechanism by which different normalization positions affect training stability and model performance?
Can the role of FFN be replaced with more efficient structures? Is MoE the only viable direction for sparsification?

References

Attention Is All You Need — Ashish Vaswani et al. (2017)
The foundational paper that introduced the Transformer architecture. The authors replaced RNNs and CNNs entirely with attention mechanisms, proposing multi-head self-attention and positional encoding. It dramatically outperformed prior models on machine translation. Every major LLM today is built on this architecture.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Jacob Devlin et al. (2018)
BERT uses masked language modeling (MLM) and next sentence prediction to pretrain a bidirectional Transformer on large text corpora, then fine-tunes for downstream tasks. It simultaneously surpassed SOTA on 11 NLP benchmarks, establishing the "pretrain+finetune" paradigm that dominates modern NLP.
Language Models are Unsupervised Multitask Learners — Alec Radford et al. (2019)
GPT-2 shows that a 1.5B parameter language model trained only on unlabeled web text can perform various language tasks zero-shot without fine-tuning. This challenged the convention that NLP tasks require task-specific training and famously became the first AI model "staged released" due to misuse concerns.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts — Nan Du et al. (2021)
1.2T parameter MoE achieves GPT-3 quality with 1/3 training compute, early representative of MoE "cost-effectiveness wins". Mixtral/DeepSeek-V2/V3 are its spiritual descendants.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — William Fedus et al. (2021)
Switch Transformer is the first architecture to scale Transformers to trillion parameters in practice. Using Mixture-of-Experts (MoE), each token only activates a small fraction of parameters ("sparse activation"), achieving better performance than dense models at the same compute. GPT-4 and Mixtral likely use similar architectures.
Mixtral of Experts — Albert Q. Jiang et al. (2024)
Mixtral 8x7B is the first widely open-sourced MoE language model: 8 expert networks, each token routes to 2, so ~13B parameters are activated with 47B total. At inference cost similar to a 13B dense model, it matches or surpasses LLaMA 2 70B, proving MoE viability for open-source models.

Transformer Architecture: The Skeleton of Modern LLMs

Intuition: repeated layers of communication and transformation

Engineering view: the critical path inside a decoder block

Example code: simplified Transformer block

Research view: architecture evolution and mixture of experts

🔬 Open Research Questions

Related Reading

Attention: Choosing the Relevant Context

Pretraining and Scaling Law: How Models Learn

KV Cache and Quantization: Making Large Models Faster

Tokenization: How Models See Text

References