Fast Transformer Decoding: One Write-Head is All You Need

Authors: Noam Shazeer (2019)

arXiv: 1911.02150

Domains

InferenceArchitecture

TLDR (English)

Proposes Multi-Query Attention: all heads share the same K/V, reducing KV cache usage to 1/h. All modern KV cache optimization and long-context inference stories start from this 5-page paper.

TLDR（中文）

提出 Multi-Query Attention：所有 head 共享同一份 K/V，把 KV cache 占用打到 1/h。今天 KV cache 优化、长上下文推理的故事都从这篇 5 页短文开始。

Appears in These Articles

高效注意力：突破序列长度平方瓶颈
Efficient Attention: Breaking the Quadratic Sequence Bottleneck

Co-cited Papers

These papers appear in the same articles as this one

Related Papers

Other papers in the same domain