Skip to content

leviathan2023-spec-decoding

arXiv: 2211.17192

TLDR (English)

Uses small draft model to predict multiple tokens, large model verifies in one pass, achieving nearly lossless 2-3x speedup. Standard technique in all inference engines (vLLM, TensorRT-LLM) today.

TLDR(中文)

用一个小 draft model 预测多个 token,再让大模型一次校验,几乎无损地获得 2-3x 加速。是当下所有推理引擎(vLLM、TensorRT-LLM)的标配技术之一。