Skip to content

Pretraining and Scaling Law: How Models Learn

Intuition: read the internet and learn to guess the next word

Section titled “Intuition: read the internet and learn to guess the next word”

Pretraining is the foundation of LLM capabilities. The model is trained on massive text (web pages, books, code, papers) with a simple task: given preceding words, predict the next one. Through repetition, it gradually learns grammar, common sense, reasoning patterns, and world knowledge.

This is like a child learning language by listening to adults: not by being taught rules directly, but by spontaneously discovering statistical patterns from massive input. The larger the model, the more diverse the data, and the longer the training, the richer the “implicit knowledge” it acquires. Scaling laws describe the observation that model performance improves predictably with scale, data volume, and compute.

Engineering view: data, stability, and compute efficiency

Section titled “Engineering view: data, stability, and compute efficiency”

In practice, the core challenge of pretraining is not “making it run,” but “making it stable, efficient, and reproducible.” Data engineering dominates: deduplication, filtering low-quality content, balancing multilingual and code ratios, and handling privacy and copyright concerns. Good data mixtures can let a smaller model outperform a larger one trained on poor data.

Training stability is another hurdle. Deep large models are prone to loss spikes, gradient explosions, or attention collapse. Common mitigations include learning-rate warmup, gradient clipping, mixed-precision training, normalization improvements, activation checkpointing, and model/pipeline parallelism.

Chinchilla showed that many models are under-trained on tokens for their size under a fixed compute budget. Compute-optimal training suggests that parameters and data should grow together, not just parameters alone. This is especially important for teams with limited budgets.

Research view: from empirical laws to mechanistic understanding

Section titled “Research view: from empirical laws to mechanistic understanding”

Scaling laws provide empirical predictive tools, but their theoretical explanations are still developing. Key questions include: why does loss follow a power law with compute? Is this universal across architectures, data distributions, and optimizers? Are there “phase transitions” beyond pure scale?

Current research directions also include: can curriculum learning accelerate convergence? How does continual pretraining learn new domains while preserving old knowledge? And what is the causal link between knowledge learned during pretraining and downstream task performance?

🔬 Open Research Questions

Key questions and research directions in this area:

  1. When do Scaling Laws break down? Are there theoretical limits like "data walls" or "parameter walls"?
  2. How to optimize compute budget allocation under resource constraints? Can small models approach large model performance with more data?
  3. Do multimodal, long-context, and sparse models follow the same Scaling Laws?

References

  • Scaling Laws for Neural Language Models — Jared Kaplan et al. (2020)

    OpenAI's scaling laws paper finds that language model performance (cross-entropy loss) follows power laws with model parameters, dataset size, and compute. This enables predicting large-scale training results from small experiments and provided the theoretical basis for the LLM scale-up race, directly leading to GPT-3.

  • Training Compute-Optimal Large Language Models — Jordan Hoffmann et al. (2022)

    Proposes the Chinchilla scaling laws: given a fixed compute budget, model parameters and training tokens should scale equally (challenging the prior belief that parameters matter more). Chinchilla 70B outperformed Gopher 280B, redefining optimal LLM training strategy.

  • Language Models are Few-Shot Learners — Tom Brown et al. (2020)

    OpenAI's GPT-3 paper demonstrates that a 175B parameter language model can perform diverse tasks through few-shot in-context learning without fine-tuning. It established the paradigm that scale unlocks emergent capabilities and launched the era of prompt engineering.

  • PaLM: Scaling Language Modeling with Pathways — Aakanksha Chowdhery et al. (2022)

    Google's 540B parameter PaLM model trained on the Pathways system. The paper details training stability techniques, data mixture strategies, and observations of emergent capabilities, serving as an important reference for large-model pretraining engineering.