GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Authors: Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai (2023)

Domains

InferenceArchitecture

TLDR (English)

GQA (Grouped Query Attention) is a middle ground between MHA and MQA: grouping KV heads so multiple query heads share the same KV, significantly reducing KV cache memory while maintaining near-MHA quality. LLaMA 2/3, Mistral, and other major models all use GQA.

TLDR（中文）

GQA（分组查询注意力）是 MHA（多头注意力）和 MQA（多查询注意力）的折中方案：将 KV 头的数量分成若干组，每组共享 KV，显著减少了 KV 缓存内存占用，同时保持接近 MHA 的模型质量。LLaMA 2/3、Mistral 等主流模型都采用了 GQA。

Appears in These Articles

高效注意力：突破序列长度平方瓶颈
Efficient Attention: Breaking the Quadratic Sequence Bottleneck

Co-cited Papers

These papers appear in the same articles as this one

Related Papers

Other papers in the same domain