KV Cache and Quantization: Making Large Models Faster
Intuition: remember what you already computed
Section titled “Intuition: remember what you already computed”LLMs generate text one token at a time. If each new token required recomputing attention over all previous tokens, it would waste enormous compute. The intuition behind KV Cache is simple: store the Key and Value vectors computed for previous tokens and reuse them directly, avoiding redundant computation.
Quantization is another cost-reduction idea: model weights are typically 32-bit floats. If reduced to 16-bit, 8-bit, or even 4-bit, memory footprint and compute drop significantly. Some precision is lost, but the trade-off is often fully acceptable in practice.
Engineering view: the memory wall and precision trade-offs
Section titled “Engineering view: the memory wall and precision trade-offs”KV Cache is one of the main memory consumers during inference, especially in long-context scenarios. Optimization directions include:
- PagedAttention (vLLM): Manage KV Cache in pages to reduce memory fragmentation and improve batching efficiency.
- KV Cache compression: Reduce cache size through quantization, pruning, sliding windows, or eviction policies.
- GQA / MQA: Share Key/Value across multiple attention heads to reduce cache volume.
Quantization techniques by category:
- PTQ (Post-Training Quantization): GPTQ, AWQ, SmoothQuant, etc. Quantize an already-trained model without retraining.
- QAT (Quantization-Aware Training): Simulate low precision during training; usually better results but higher cost.
- GGML/GGUF: Community-standard 4-bit formats that let large models run on laptops.
In practice, evaluate perplexity, downstream task accuracy, and end-to-end latency after quantization—not just memory savings. Different layers have different sensitivity to precision; mixed precision or per-layer tuning often works best.
Research view: the boundary of precision and efficiency
Section titled “Research view: the boundary of precision and efficiency”Research questions include: what are the limits of quantization? Can 1-bit or ternary weights still preserve language ability? How can activation distribution analysis find optimal clipping thresholds and scaling factors?
Another frontier is speculative decoding: a small model rapidly generates candidate sequences, and the large model verifies and corrects them in parallel, achieving 2-3x speedup without quality loss. This is essentially a rebalancing between computation and memory.
References
- kwon2023-vllm
Introduces OS "paged memory" concept to KV cache, virtually eliminating OOM waste and multiplying throughput 2-4x. vLLM thereby becomes de facto standard open-source inference engine; compute foundation for MCP/Agent era.
- dettmers2022-llmint8
Reveals "emergent outliers" in large model activations and proposes mixed-precision solution. Core work behind bitsandbytes library, first enabling 175B models to fit in 8 A100s.
- dettmers2023-qlora
4-bit NF4 + LoRA + paged optimizer enables SFT of 65B on single 48GB GPU. Open-source community fine-tuning of LLaMA-2/3, Qwen uses this approach almost 100%.
- frantar2022-gptq
First to achieve "4-bit quantization of 175B model on single GPU with almost no accuracy loss". Lowered LLM inference hardware barrier from 8xA100 to single consumer GPU, popularizing "run open-source LLMs locally".
- lin2023-awq
Discovers "few critical weights correspond to large activations", applies per-channel scaling by importance. More robust and faster than GPTQ at 4-bit, one of mainstream INT4 deployment solutions today.
- xiao2022-smoothquant
Moves activation outliers to weights through equivalent mathematical transformation, making INT8 inference feasible. Key engineering discovery enabling GPU FP8/INT8 deployment.