New Training, Inference, and Applications Modules

LLM Primer started with foundational concepts like Tokenization, Attention, and Transformer. As the content matured, we expanded into three new modules:

Training

Pretraining and Scaling Law: From data engineering to compute-optimal training, understanding how models “learn to predict the next word.”
Fine-Tuning and Alignment: How SFT, RLHF, DPO, and related methods turn a “generalist” into an “obedient assistant.”

A useful first approximation for training compute is:

C \approx 6ND

where $N$ is parameter count, $D$ is training tokens, and $C$ is compute.

Inference

KV Cache and Quantization: The two pillars of inference optimization—caching computed results and reducing precision to save resources.
Efficient Attention: FlashAttention, sparse attention, and frontier explorations toward linear complexity.

Applications

RAG and Retrieval Augmentation: Giving models updatable external memory to reduce hallucinations and improve traceability.
Agents and Tool Use: From chat to action—how models interact with the real world through tools.

Each article follows LLM Primer’s three-tier design: intuition first, engineering trade-offs next, and research questions when useful. Contributions via PR—new content, corrections, or adopting inbox papers—are always welcome.

Interactive: Pick the module you need

Check the direction that best matches your current learning goal.

I need to understand model training I need to reduce inference cost or latency I need to connect models to a real product workflow

Suggested module

Training maps to model learning, Inference maps to cost and latency, and Applications maps to product integration.