Skip to content

shazeer2019-mqa

arXiv: 1911.02150

TLDR (English)

Proposes Multi-Query Attention: all heads share the same K/V, reducing KV cache usage to 1/h. All modern KV cache optimization and long-context inference stories start from this 5-page paper.

TLDR(中文)

提出 Multi-Query Attention:所有 head 共享同一份 K/V,把 KV cache 占用打到 1/h。今天 KV cache 优化、长上下文推理的故事都从这篇 5 页短文开始。