Fast Inference from Transformers via Speculative Decoding

作者： Yaniv Leviathan, Matan Kalman, Yossi Matias (2023)

领域

推理

TLDR（中文）

用一个小 draft model 预测多个 token，再让大模型一次校验，几乎无损地获得 2-3x 加速。是当下所有推理引擎（vLLM、TensorRT-LLM）的标配技术之一。

TLDR (English)

Uses small draft model to predict multiple tokens, large model verifies in one pass, achieving nearly lossless 2-3x speedup. Standard technique in all inference engines (vLLM, TensorRT-LLM) today.

出现在这些文章里

Sampling 与 Decoding：从概率到文字
Sampling and Decoding: From Probabilities to Text

同被引用

这些论文与本文出现在同一篇文章中

Fast Inference from Transformers via Speculative Decoding

领域

TLDR（中文）

TLDR (English)

出现在这些文章里

同被引用

相关论文