Skip to content

Efficient Streaming Language Models with Attention Sinks

Authors: Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis (2023)

arXiv: 2309.17453

Domains

InferenceLong Context

TLDR (English)

Discovers the Attention Sink phenomenon: in autoregressive generation, models consistently attend to a few initial tokens. StreamingLLM leverages this to handle infinite-length input streams without recomputation while maintaining stable performance.

TLDR(中文)

提出 Attention Sink 现象:在自回归生成中,模型始终关注开头的几个初始 token。利用这一发现,StreamingLLM 可以在不重新计算的情况下处理无限长输入流,同时保持性能稳定。

Related Papers

Other papers in the same domain