Skip to content

Tokenization: How Models See Text

LLMs do not read raw characters in the way humans do. A tokenizer first maps text into tokens: words, subwords, punctuation, spaces, Chinese characters, emoji fragments, or code pieces. The model then predicts the next token from the previous token sequence.

BPE-style tokenization repeatedly merges frequent fragments, balancing character-level coverage with word-level efficiency. A good vocabulary keeps common text compact while still representing rare words and new strings.

字符:30Token:14
语言模型文本 token,再预测一个 token

Engineering view: the tokenizer is an interface contract

Section titled “Engineering view: the tokenizer is an interface contract”

The tokenizer determines context length, cost, truncation, cache keys, and data compatibility. Changing it changes token IDs, so old prompts, fine-tuning data, embeddings, and KV-cache assumptions may break. Production systems should pin tokenizer versions and estimate budgets in tokens rather than characters.

Multilingual and code-heavy workloads need extra care: the same semantic content can require very different token counts across languages. RAG pipelines should reserve token budget for instructions, retrieved evidence, and the model answer.

References

  • Efficient Estimation of Word Representations in Vector Space — Tomas Mikolov et al. (2013)

    Word2Vec introduced the concept of word embeddings: training neural networks on large text corpora so semantically similar words cluster in vector space. The famous "king - man + woman ≈ queen" analogy demonstrated its power, laying the foundation for embedding layers in all subsequent language models.