Pretraining and Scaling Law: How Models Learn

Intuition: read the internet and learn to guess the next word

Pretraining is the foundation of LLM capabilities. The model is trained on massive text (web pages, books, code, papers) with a simple task: given preceding words, predict the next one. Through repetition, it gradually learns grammar, common sense, reasoning patterns, and world knowledge.

This is like a child learning language by listening to adults: not by being taught rules directly, but by spontaneously discovering statistical patterns from massive input. The larger the model, the more diverse the data, and the longer the training, the richer the “implicit knowledge” it acquires. Scaling laws describe the observation that model performance improves predictably with scale, data volume, and compute.

Engineering view: data, stability, and compute efficiency

In practice, the core challenge of pretraining is not “making it run,” but “making it stable, efficient, and reproducible.” Data engineering dominates: deduplication, filtering low-quality content, balancing multilingual and code ratios, and handling privacy and copyright concerns. Good data mixtures can let a smaller model outperform a larger one trained on poor data.

Training stability is another hurdle. Deep large models are prone to loss spikes, gradient explosions, or attention collapse. Common mitigations include learning-rate warmup, gradient clipping, mixed-precision training, normalization improvements, activation checkpointing, and model/pipeline parallelism.

Chinchilla showed that many models are under-trained on tokens for their size under a fixed compute budget. Compute-optimal training suggests that parameters and data should grow together, not just parameters alone. This is especially important for teams with limited budgets.

Research view: from empirical laws to mechanistic understanding

Scaling laws provide empirical predictive tools, but their theoretical explanations are still developing. Key questions include: why does loss follow a power law with compute? Is this universal across architectures, data distributions, and optimizers? Are there “phase transitions” beyond pure scale?

Current research directions also include: can curriculum learning accelerate convergence? How does continual pretraining learn new domains while preserving old knowledge? And what is the causal link between knowledge learned during pretraining and downstream task performance?

🔬 Open Research Questions

Key questions and research directions in this area:

When do Scaling Laws break down? Are there theoretical limits like "data walls" or "parameter walls"?

Related: hoffmann2022 chinchilla
How to optimize compute budget allocation under resource constraints? Can small models approach large model performance with more data?
Do multimodal, long-context, and sparse models follow the same Scaling Laws?

References

Scaling Laws for Neural Language Models — Jared Kaplan et al. (2020)
OpenAI's scaling laws paper finds that language model performance (cross-entropy loss) follows power laws with model parameters, dataset size, and compute. This enables predicting large-scale training results from small experiments and provided the theoretical basis for the LLM scale-up race, directly leading to GPT-3.
Training Compute-Optimal Large Language Models — Jordan Hoffmann et al. (2022)
Proposes the Chinchilla scaling laws: given a fixed compute budget, model parameters and training tokens should scale equally (challenging the prior belief that parameters matter more). Chinchilla 70B outperformed Gopher 280B, redefining optimal LLM training strategy.
Language Models are Few-Shot Learners — Tom Brown et al. (2020)
OpenAI's GPT-3 paper demonstrates that a 175B parameter language model can perform diverse tasks through few-shot in-context learning without fine-tuning. It established the paradigm that scale unlocks emergent capabilities and launched the era of prompt engineering.
PaLM: Scaling Language Modeling with Pathways — Aakanksha Chowdhery et al. (2022)
Google's 540B parameter PaLM model trained on the Pathways system. The paper details training stability techniques, data mixture strategies, and observations of emergent capabilities, serving as an important reference for large-model pretraining engineering.
RoBERTa: A Robustly Optimized BERT Pretraining Approach — Yinhan Liu et al. (2019)
Uses more data, longer training, removes NSP to prove BERT was far from fully trained. Important not just for stronger model, but for first clearly demonstrating that "training recipe" itself is a core research question.
Deduplicating Training Data Makes Language Models Better — Katherine Lee et al. (2022)
Systematically demonstrates that deduplicating training data significantly improves language model performance and reduces memorization. By removing near-duplicate and exact-duplicate examples from C4 and RealNews, models perform better on downstream tasks and are far less likely to emit training data verbatim.

Pretraining and Scaling Law: How Models Learn

Intuition: read the internet and learn to guess the next word

Engineering view: data, stability, and compute efficiency

Research view: from empirical laws to mechanistic understanding

🔬 Open Research Questions

Related Reading

Fine-Tuning and Alignment: Making Models Follow Instructions

Why LLMs Emerge Abilities

KV Cache and Quantization: Making Large Models Faster

References