Tokenization: How Models See Text
Intuition: text becomes building blocks
Section titled “Intuition: text becomes building blocks”LLMs do not read raw characters in the way humans do. A tokenizer first maps text into tokens: words, subwords, punctuation, spaces, Chinese characters, emoji fragments, or code pieces. The model then predicts the next token from the previous token sequence.
BPE-style tokenization repeatedly merges frequent fragments, balancing character-level coverage with word-level efficiency. A good vocabulary keeps common text compact while still representing rare words and new strings.
Engineering view: the tokenizer is an interface contract
Section titled “Engineering view: the tokenizer is an interface contract”The tokenizer determines context length, cost, truncation, cache keys, and data compatibility. Changing it changes token IDs, so old prompts, fine-tuning data, embeddings, and KV-cache assumptions may break. Production systems should pin tokenizer versions and estimate budgets in tokens rather than characters.
Multilingual and code-heavy workloads need extra care: the same semantic content can require very different token counts across languages. RAG pipelines should reserve token budget for instructions, retrieved evidence, and the model answer.
References
- Efficient Estimation of Word Representations in Vector Space
Word2Vec introduced the concept of word embeddings: training neural networks on large text corpora so semantically similar words cluster in vector space. The famous "king - man + woman ≈ queen" analogy demonstrated its power, laying the foundation for embedding layers in all subsequent language models.