Efficient Memory Management for Large Language Model Serving with PagedAttention

Authors: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica (2023)

arXiv: 2309.06180

Domains

Inference

TLDR (English)

Introduces OS "paged memory" concept to KV cache, virtually eliminating OOM waste and multiplying throughput 2-4x. vLLM thereby becomes de facto standard open-source inference engine; compute foundation for MCP/Agent era.

TLDR（中文）

把操作系统的"分页内存"思想引入 KV cache，几乎消灭 OOM 浪费，让吞吐量翻 2-4 倍。vLLM 由此成为开源推理引擎事实标准；MCP/Agent 时代的算力底座。

Appears in These Articles

KV Cache 与量化：让大模型跑得更快
KV Cache and Quantization: Making Large Models Faster

Co-cited Papers

These papers appear in the same articles as this one

Related Papers

Other papers in the same domain