Neural Machine Translation of Rare Words with Subword Units

作者： Rico Sennrich, Barry Haddow, Alexandra Birch (2016)

领域

架构

TLDR（中文）

提出将 BPE（字节对编码）应用于神经机器翻译的分词。通过迭代地合并出现频率最高的字符对， BPE 在词汇表大小和对罕见词的处理能力之间取得平衡。这是 GPT 系列等大多数现代 LLM 分词器的直接原型。

TLDR (English)

Proposes applying BPE (Byte Pair Encoding) to tokenization for neural machine translation. By iteratively merging the most frequent character pairs, BPE balances vocabulary size and ability to handle rare words. This is the direct prototype for tokenizers in GPT and most modern LLMs.

出现在这些文章里

Tokenization：模型如何看见文字
Tokenization: How Models See Text

同被引用

这些论文与本文出现在同一篇文章中

Neural Machine Translation of Rare Words with Subword Units

领域

TLDR（中文）

TLDR (English)

出现在这些文章里

同被引用

相关论文