FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

作者： Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré (2022)

领域

推理

TLDR（中文）

FlashAttention 通过 IO-aware 的分块计算，在不牺牲精度的前提下，将注意力计算的内存从 O(N²) 降至 O(N)，速度提升 2-4 倍。它改变了长上下文训练的可行性边界，是现代高效 LLM 训练和推理不可或缺的底层优化。

TLDR (English)

FlashAttention uses IO-aware tiled computation to reduce attention memory from O(N²) to O(N) without losing precision, achieving 2-4x speedup. It fundamentally changed what's feasible for long-context training and is now an indispensable optimization in modern LLM training and inference.

出现在这些文章里

高效注意力：突破序列长度平方瓶颈
Efficient Attention: Breaking the Quadratic Sequence Bottleneck

同被引用

这些论文与本文出现在同一篇文章中

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

领域

TLDR（中文）

TLDR (English)

出现在这些文章里

同被引用

相关论文