Efficient Streaming Language Models with Attention Sinks
arXiv: 2309.17453
领域
TLDR(中文)
提出 Attention Sink 现象:在自回归生成中,模型始终关注开头的几个初始 token。利用这一发现,StreamingLLM 可以在不重新计算的情况下处理无限长输入流,同时保持性能稳定。
TLDR (English)
Discovers the Attention Sink phenomenon: in autoregressive generation, models consistently attend to a few initial tokens. StreamingLLM leverages this to handle infinite-length input streams without recomputation while maintaining stable performance.
相关论文
同一领域的其他论文