FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
arXiv: 2205.14135
TLDR(中文)
FlashAttention 通过 IO-aware 的分块计算,在不牺牲精度的前提下,将注意力计算的内存 从 O(N²) 降至 O(N),速度提升 2-4 倍。它改变了长上下文训练的可行性边界, 是现代高效 LLM 训练和推理不可或缺的底层优化。
TLDR (English)
FlashAttention uses IO-aware tiled computation to reduce attention memory from O(N²) to O(N) without losing precision, achieving 2-4x speedup. It fundamentally changed what's feasible for long-context training and is now an indispensable optimization in modern LLM training and inference.