FlashAttention-3: Fast and Accurate Attention with Asympotic IO Complexity

Authors: Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao (2024)

Domains

Inference

TLDR (English)

Leverages H100's async TMA and FP8 to push attention to 1.2 PFLOPs while maintaining numerical precision. Key dependency for long-context + FP8 training on Hopper architecture.

TLDR（中文）

利用 H100 的异步 TMA 与 FP8，把 attention 推到 1.2 PFLOPs，并保持数值精度。是 Hopper 架构上长上下文 + FP8 训练的关键依赖。

Appears in These Articles

高效注意力：突破序列长度平方瓶颈
Efficient Attention: Breaking the Quadratic Sequence Bottleneck

Co-cited Papers

These papers appear in the same articles as this one

Related Papers

Other papers in the same domain