Fast Inference from Transformers via Speculative Decoding

Authors: Yaniv Leviathan, Matan Kalman, Yossi Matias (2023)

Domains

Inference

TLDR (English)

Uses small draft model to predict multiple tokens, large model verifies in one pass, achieving nearly lossless 2-3x speedup. Standard technique in all inference engines (vLLM, TensorRT-LLM) today.

TLDR（中文）

用一个小 draft model 预测多个 token，再让大模型一次校验，几乎无损地获得 2-3x 加速。是当下所有推理引擎（vLLM、TensorRT-LLM）的标配技术之一。

Appears in These Articles

Sampling 与 Decoding：从概率到文字
Sampling and Decoding: From Probabilities to Text

Co-cited Papers

These papers appear in the same articles as this one

Related Papers

Other papers in the same domain