GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
arXiv: 2305.13245
TLDR (English)
GQA (Grouped Query Attention) is a middle ground between MHA and MQA: grouping KV heads so multiple query heads share the same KV, significantly reducing KV cache memory while maintaining near-MHA quality. LLaMA 2/3, Mistral, and other major models all use GQA.
TLDR(中文)
GQA(分组查询注意力)是 MHA(多头注意力)和 MQA(多查询注意力)的折中方案: 将 KV 头的数量分成若干组,每组共享 KV,显著减少了 KV 缓存内存占用, 同时保持接近 MHA 的模型质量。LLaMA 2/3、Mistral 等主流模型都采用了 GQA。