Efficient Streaming Language Models with Attention Sinks

Authors: Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis (2023)

Domains

InferenceLong Context

TLDR (English)

Discovers the Attention Sink phenomenon: in autoregressive generation, models consistently attend to a few initial tokens. StreamingLLM leverages this to handle infinite-length input streams without recomputation while maintaining stable performance.

TLDR（中文）

提出 Attention Sink 现象：在自回归生成中，模型始终关注开头的几个初始 token。利用这一发现，StreamingLLM 可以在不重新计算的情况下处理无限长输入流，同时保持性能稳定。

Appears in These Articles

长上下文：让模型读得更远
Long Context: Helping Models Read Farther

Co-cited Papers

These papers appear in the same articles as this one

Related Papers

Other papers in the same domain