Fast Transformer Decoding: One Write-Head is All You Need

作者： Noam Shazeer (2019)

arXiv： 1911.02150

领域

推理架构

TLDR（中文）

提出 Multi-Query Attention：所有 head 共享同一份 K/V，把 KV cache 占用打到 1/h。今天 KV cache 优化、长上下文推理的故事都从这篇 5 页短文开始。

TLDR (English)

Proposes Multi-Query Attention: all heads share the same K/V, reducing KV cache usage to 1/h. All modern KV cache optimization and long-context inference stories start from this 5-page paper.

出现在这些文章里

高效注意力：突破序列长度平方瓶颈
Efficient Attention: Breaking the Quadratic Sequence Bottleneck

同被引用

这些论文与本文出现在同一篇文章中

Fast Transformer Decoding: One Write-Head is All You Need

领域

TLDR（中文）

TLDR (English)

出现在这些文章里

同被引用

相关论文