Why LLMs Emerge Abilities

Intuition: more practice brings new combinatorial abilities

Many LLM abilities come from a simple objective: predict the next token on massive text. When the model is large enough, the data is diverse enough, and training is long enough, it learns grammar, facts, formatting, reasoning templates, and tool-use traces. “Emergence” often refers to abilities that suddenly become observable as scale increases.

But emergence is not magic. Many phenomena are shaped by the choice of evaluation metric and prompting style: a continuously improving ability can look like a sudden jump when measured by a pass/fail threshold. The right attitude is to acknowledge qualitative changes from scale while guarding against anthropomorphism.

Engineering view: scaling laws and compute-optimal training

Scaling laws study the empirical relationship between model size, data volume, compute, and loss. Early results pushed the “bigger is better” route; Chinchilla further showed that many models are under-trained on tokens for their size, and that data and parameters should be more balanced under a fixed compute budget.

In practice, capability is not determined by parameter count alone. Data quality, deduplication, mixing ratios, context length, training stability, alignment, and inference strategy all change outcomes. Chain-of-thought and similar prompting methods show that existing capabilities may need the right interface to be unlocked.

Example code: simple scaling law visualization

import numpy as np

def scaling_law_loss(N, D, alpha=0.34, beta=0.28):
    """
    Simplified scaling law: L(N, D) ≈ N^(-alpha) + D^(-beta)
    N: model parameters (in billions)
    D: training data (in billions of tokens)
    """
    return N**(-alpha) + D**(-beta)

# Plot relationship between parameters and loss
params = np.logspace(0, 3, 50)  # 1B to 1000B parameters
data_fixed = 300  # Fixed at 300B tokens
losses = [scaling_law_loss(p, data_fixed) for p in params]

print(f"Predicted loss for 1B param model: {scaling_law_loss(1, data_fixed):.4f}")
print(f"Predicted loss for 10B param model: {scaling_law_loss(10, data_fixed):.4f}")
print(f"Predicted loss for 100B param model: {scaling_law_loss(100, data_fixed):.4f}")

# Show compute-optimal balance
print("\nCompute-optimal examples:")
for compute_budget in [1e18, 1e20, 1e22]:  # FLOPs
    # Simplified: assume N and D should be roughly balanced
    N_opt = (compute_budget / 6) ** 0.5 / 1e9  # Convert to billions
    D_opt = (compute_budget / 6) ** 0.5 / 1e9  # Convert to billions
    print(f"Compute budget {compute_budget:.0e} FLOPs: "
          f"optimal ~{N_opt:.1f}B params, {D_opt:.1f}B tokens")

Research view: treat emergence as a testable hypothesis

Research on emergence should report continuous metrics, calibration curves, task difficulty, and prompt sensitivity. Distinguish statistical patterns learned during pretraining, in-context learning, tool externalization, and post-training alignment. The better question is not “does the model truly understand?” but “under what distributions, constraints, and interventions does it stably exhibit which predictable capabilities?”

🔬 Open Research Questions

Key questions and research directions in this area:

Is emergent ability a continuous phase transition or an artifact of measurement methodology? How can we design more sensitive evaluations to test this?

Related: wei2022 cot
Small models (e.g., Phi-1/3) can approach large model capabilities through high-quality data. How does this revise the scaling law narrative?

Related: gunasekar2023 phi1 , hoffmann2022 chinchilla
Can reasoning abilities (e.g., chain-of-thought) be acquired through pure pretraining, or do they require explicit guidance during fine-tuning?

Related: wei2022 cot , brown2020 gpt3

References

Scaling Laws for Neural Language Models — Jared Kaplan et al. (2020)
OpenAI's scaling laws paper finds that language model performance (cross-entropy loss) follows power laws with model parameters, dataset size, and compute. This enables predicting large-scale training results from small experiments and provided the theoretical basis for the LLM scale-up race, directly leading to GPT-3.
Language Models are Few-Shot Learners — Tom Brown et al. (2020)
OpenAI's GPT-3 paper demonstrates that a 175B parameter language model can perform diverse tasks through few-shot in-context learning without fine-tuning. It established the paradigm that scale unlocks emergent capabilities and launched the era of prompt engineering.
Training Compute-Optimal Large Language Models — Jordan Hoffmann et al. (2022)
Proposes the Chinchilla scaling laws: given a fixed compute budget, model parameters and training tokens should scale equally (challenging the prior belief that parameters matter more). Chinchilla 70B outperformed Gopher 280B, redefining optimal LLM training strategy.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Jason Wei et al. (2022)
Introduces chain-of-thought prompting: adding intermediate reasoning steps to prompts dramatically improves LLM performance on math, logic, and commonsense reasoning tasks. This simple technique brought LLM reasoning capabilities close to human-level performance.
GPT-4 Technical Report — OpenAI (2023)
Industry report rather than full paper, but first to explicitly use "predictable scaling" as product delivery commitment and systematically disclose safety/red-team processes. Turning point from LLM as "research demo" to "infrastructure".
Textbooks Are All You Need — Suriya Gunasekar et al. (2023)
Microsoft uses 7B tokens of high-quality "textbook-level" synthetic data to train 1.3B model approaching GPT-3.5 on HumanEval. Takes "data quality >> data scale" story to extreme, launching Phi series.

Why LLMs Emerge Abilities

Intuition: more practice brings new combinatorial abilities

Engineering view: scaling laws and compute-optimal training

Example code: simple scaling law visualization

Research view: treat emergence as a testable hypothesis

🔬 Open Research Questions

Related Reading

Pretraining and Scaling Law: How Models Learn

Transformer Architecture: The Skeleton of Modern LLMs

Prompt Engineering: The Art of Talking to Models

Tokenization: How Models See Text

References