Skip to content

KV Cache and Quantization: Making Large Models Faster

Intuition: remember what you already computed

Section titled “Intuition: remember what you already computed”

LLMs generate text one token at a time. If each new token required recomputing attention over all previous tokens, it would waste enormous compute. The intuition behind KV Cache is simple: store the Key and Value vectors computed for previous tokens and reuse them directly, avoiding redundant computation.

Quantization is another cost-reduction idea: model weights are typically 32-bit floats. If reduced to 16-bit, 8-bit, or even 4-bit, memory footprint and compute drop significantly. Some precision is lost, but the trade-off is often fully acceptable in practice.

Engineering view: the memory wall and precision trade-offs

Section titled “Engineering view: the memory wall and precision trade-offs”

KV Cache is one of the main memory consumers during inference, especially in long-context scenarios. Optimization directions include:

  • PagedAttention (vLLM): Manage KV Cache in pages to reduce memory fragmentation and improve batching efficiency.
  • KV Cache compression: Reduce cache size through quantization, pruning, sliding windows, or eviction policies.
  • GQA / MQA: Share Key/Value across multiple attention heads to reduce cache volume.

Quantization techniques by category:

  • PTQ (Post-Training Quantization): GPTQ, AWQ, SmoothQuant, etc. Quantize an already-trained model without retraining.
  • QAT (Quantization-Aware Training): Simulate low precision during training; usually better results but higher cost.
  • GGML/GGUF: Community-standard 4-bit formats that let large models run on laptops.

In practice, evaluate perplexity, downstream task accuracy, and end-to-end latency after quantization—not just memory savings. Different layers have different sensitivity to precision; mixed precision or per-layer tuning often works best.

Research view: the boundary of precision and efficiency

Section titled “Research view: the boundary of precision and efficiency”

Research questions include: what are the limits of quantization? Can 1-bit or ternary weights still preserve language ability? How can activation distribution analysis find optimal clipping thresholds and scaling factors?

Another frontier is speculative decoding: a small model rapidly generates candidate sequences, and the large model verifies and corrects them in parallel, achieving 2-3x speedup without quality loss. This is essentially a rebalancing between computation and memory.

References

  • kwon2023-vllm

    Introduces OS "paged memory" concept to KV cache, virtually eliminating OOM waste and multiplying throughput 2-4x. vLLM thereby becomes de facto standard open-source inference engine; compute foundation for MCP/Agent era.

  • dettmers2022-llmint8

    Reveals "emergent outliers" in large model activations and proposes mixed-precision solution. Core work behind bitsandbytes library, first enabling 175B models to fit in 8 A100s.

  • dettmers2023-qlora

    4-bit NF4 + LoRA + paged optimizer enables SFT of 65B on single 48GB GPU. Open-source community fine-tuning of LLaMA-2/3, Qwen uses this approach almost 100%.

  • frantar2022-gptq

    First to achieve "4-bit quantization of 175B model on single GPU with almost no accuracy loss". Lowered LLM inference hardware barrier from 8xA100 to single consumer GPU, popularizing "run open-source LLMs locally".

  • lin2023-awq

    Discovers "few critical weights correspond to large activations", applies per-channel scaling by importance. More robust and faster than GPTQ at 4-bit, one of mainstream INT4 deployment solutions today.

  • xiao2022-smoothquant

    Moves activation outliers to weights through equivalent mathematical transformation, making INT8 inference feasible. Key engineering discovery enabling GPU FP8/INT8 deployment.