Neural Machine Translation of Rare Words with Subword Units
arXiv: 1508.07909
Domains
TLDR (English)
Proposes applying BPE (Byte Pair Encoding) to tokenization for neural machine translation. By iteratively merging the most frequent character pairs, BPE balances vocabulary size and ability to handle rare words. This is the direct prototype for tokenizers in GPT and most modern LLMs.
TLDR(中文)
提出将 BPE(字节对编码)应用于神经机器翻译的分词。通过迭代地合并出现频率最高的字符对, BPE 在词汇表大小和对罕见词的处理能力之间取得平衡。这是 GPT 系列等大多数现代 LLM 分词器的直接原型。
Appears in These Articles
Co-cited Papers
These papers appear in the same articles as this one
Related Papers
Other papers in the same domain