Tokenization: How Models See Text

Intuition: text becomes building blocks

LLMs do not read raw characters in the way humans do. A tokenizer first maps text into tokens: words, subwords, punctuation, spaces, Chinese characters, emoji fragments, or code pieces. The model then predicts the next token from the previous token sequence.

BPE-style tokenization repeatedly merges frequent fragments, balancing character-level coverage with word-level efficiency. A good vocabulary keeps common text compact while still representing rare words and new strings.

输入文本

字符：30Token：14

大语言模型把文本切成 token，再预测下一个 token。

Engineering view: the tokenizer is an interface contract

The tokenizer determines context length, cost, truncation, cache keys, and data compatibility. Changing it changes token IDs, so old prompts, fine-tuning data, embeddings, and KV-cache assumptions may break. Production systems should pin tokenizer versions and estimate budgets in tokens rather than characters.

Multilingual and code-heavy workloads need extra care: the same semantic content can require very different token counts across languages. RAG pipelines should reserve token budget for instructions, retrieved evidence, and the model answer.

Example code: BPE tokenization

Below is a simplified BPE implementation showing how frequent pairs are iteratively merged to build a subword vocabulary:

from collections import Counter

def get_vocab(corpus):
    """Split text into character-level vocabulary"""
    vocab = Counter()
    for word in corpus:
        vocab[' '.join(word) + ' </w>'] += 1
    return vocab

def get_pairs(vocab):
    """Get all adjacent token pairs and their frequencies"""
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_vocab(pair, vocab):
    """Merge specified pair in vocabulary"""
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = vocab[word]
    return new_vocab

# Example: training BPE
corpus = ['low', 'lower', 'newest', 'widest']
vocab = get_vocab(corpus)
print("Initial vocab:", vocab)

# Iteratively merge most frequent pair
num_merges = 3
for i in range(num_merges):
    pairs = get_pairs(vocab)
    if not pairs:
        break
    best_pair = pairs.most_common(1)[0][0]
    vocab = merge_vocab(best_pair, vocab)
    print(f"Merge {best_pair}: {vocab}")

Research view: linguistic boundaries of vocabulary construction

Tokenization is not only an engineering problem but also involves linguistic assumptions. Subword algorithms (BPE, WordPiece, SentencePiece, Unigram) deeply affect how models learn morphology and word formation. For example, does BPE’s greedy merging bias models toward frequent compound words while ignoring rare but meaningful prefixes/suffixes?

Tokenization fairness in multilingual models is an active area: “tokens per word” varies dramatically across languages, potentially causing systematically lower representation quality for resource-scarce languages. Byte-level BPE (as used in GPT-2) attempts to use bytes rather than Unicode characters as base units, improving coverage of unknown characters and code, but also produces longer sequences.

🔬 Open Research Questions

Key questions and research directions in this area:

Does BPE's greedy merging systematically bias toward high-frequency compounds? How to quantify this bias's impact on linguistic learning?

Related: sennrich2016 bpe
How to quantify multilingual tokenization fairness? How does "tokens per word" variance affect model performance across languages?

Related: mikolov2013 word2vec
Byte-level vs character-level tokenization: Can the sequence length vs coverage tradeoff be further optimized?
Do specialized domains (code, math symbols, emoji) require custom tokenization strategies?

References

Efficient Estimation of Word Representations in Vector Space — Tomas Mikolov et al. (2013)
Word2Vec introduced the concept of word embeddings: training neural networks on large text corpora so semantically similar words cluster in vector space. The famous "king - man + woman ≈ queen" analogy demonstrated its power, laying the foundation for embedding layers in all subsequent language models.
Neural Machine Translation of Rare Words with Subword Units — Rico Sennrich et al. (2016)
Proposes applying BPE (Byte Pair Encoding) to tokenization for neural machine translation. By iteratively merging the most frequent character pairs, BPE balances vocabulary size and ability to handle rare words. This is the direct prototype for tokenizers in GPT and most modern LLMs.

Tokenization: How Models See Text

Intuition: text becomes building blocks

Engineering view: the tokenizer is an interface contract

Example code: BPE tokenization

Research view: linguistic boundaries of vocabulary construction

🔬 Open Research Questions

Related Reading

Embeddings: Putting Discrete Symbols into Continuous Space

Prompt Engineering: The Art of Talking to Models

Transformer Architecture: The Skeleton of Modern LLMs

Attention: Choosing the Relevant Context

References