跳转到内容

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

作者: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré (2022)

arXiv: 2205.14135

TLDR(中文)

FlashAttention 通过 IO-aware 的分块计算,在不牺牲精度的前提下,将注意力计算的内存 从 O(N²) 降至 O(N),速度提升 2-4 倍。它改变了长上下文训练的可行性边界, 是现代高效 LLM 训练和推理不可或缺的底层优化。

TLDR (English)

FlashAttention uses IO-aware tiled computation to reduce attention memory from O(N²) to O(N) without losing precision, achieving 2-4x speedup. It fundamentally changed what's feasible for long-context training and is now an indispensable optimization in modern LLM training and inference.