FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

作者： Tri Dao (2023)

领域

推理

TLDR（中文）

用更激进的 warp 级并行和 work partition 把 FlashAttention 再翻倍。今天 vLLM / SGLang / Megatron 训练后端基本都升级到 FA-2。

TLDR (English)

Uses more aggressive warp-level parallelism and work partitioning to double FlashAttention performance. Today vLLM/SGLang/Megatron training backends have all upgraded to FA-2.

出现在这些文章里

高效注意力：突破序列长度平方瓶颈
Efficient Attention: Breaking the Quadratic Sequence Bottleneck

同被引用

这些论文与本文出现在同一篇文章中

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

领域

TLDR（中文）

TLDR (English)

出现在这些文章里

同被引用

相关论文