Embeddings: Putting Discrete Symbols into Continuous Space

Intuition: similar meanings cluster together

An embedding turns a token, sentence, or document into a vector of numbers. In an ideal embedding space, semantically similar items are closer together: “cat” and “dog” are nearer to each other than “cat” and “rocket”. This lets models process discrete language with continuous math, and enables search, clustering, recommendation, and RAG.

Early word vectors showed that vector differences can encode relationships (e.g., king - man + woman ≈ queen). Later contextual representations showed that the same word should have different meanings in different sentences. For example, “Apple” in “Apple released a phone” and “I ate an apple” should not be identical.

Engineering view: vector quality depends on the objective

Training embeddings can use many objectives: predicting neighboring words, language modeling, contrastive learning, or supervised fine-tuning. Different objectives produce different geometric structures. Retrieval systems care about recall and ranking, so you must evaluate vector dimension, normalization, distance function, chunking strategy, and negative-sample quality.

In an LLM, the input embedding layer maps token IDs to vectors; the output layer is often tied to or derived from the embedding weights. In RAG, the embedding model is the entry point to the external index. Do not treat embeddings as the whole story: they excel at similarity, but still need extra evaluation for numbers, negation, time, and compositional logic.

Example code: semantic relationships in word vectors

import numpy as np

def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors"""
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# Simplified word vectors (real dimensions are typically 256-1024)
embeddings = {
    'king': np.array([0.5, 0.8, 0.2, 0.9]),
    'queen': np.array([0.4, 0.7, 0.3, 0.85]),
    'man': np.array([0.6, 0.3, 0.1, 0.4]),
    'woman': np.array([0.5, 0.2, 0.2, 0.35]),
    'cat': np.array([0.1, 0.1, 0.9, 0.2]),
    'dog': np.array([0.15, 0.12, 0.85, 0.25]),
}

# Show analogy: king - man + woman ≈ queen
king_man_woman = embeddings['king'] - embeddings['man'] + embeddings['woman']
similarity = cosine_similarity(king_man_woman, embeddings['queen'])
print(f"'king-man+woman' similarity to 'queen': {similarity:.3f}")

# Show semantic similarity
cat_dog_sim = cosine_similarity(embeddings['cat'], embeddings['dog'])
cat_king_sim = cosine_similarity(embeddings['cat'], embeddings['king'])
print(f"cat-dog similarity: {cat_dog_sim:.3f}")
print(f"cat-king similarity: {cat_king_sim:.3f}")

Research view: geometry and semantics of vector space

Research on embedding space geometry reveals statistical regularities of language, but also systematic biases. Word2Vec’s “king - man + woman ≈ queen” demonstrated linear encoding of analogical relations, but do these relations hold fairly across all sociocultural concepts? Studies show that pretrained word vectors often carry gender, racial, and occupational biases that propagate to downstream tasks.

Contextual representations (such as ELMo and BERT) mitigate some ambiguity problems of static word vectors, but introduce new questions: representations from different layers capture different levels of linguistic structure (surface syntax vs. deep semantics). Probing tasks attempt to determine what models “know,” but does a successful probe mean the model truly grasps a concept, or merely exploits statistical cues? This remains a central debate in representation learning.

🔬 Open Research Questions

Key questions and research directions in this area:

What is the fundamental difference between static word vectors (word2vec, GloVe) and contextualized representations (ELMo)? Is this evolution inevitable?

Related: mikolov2013 word2vec , pennington2014 glove , peters2018 elmo
What is the mathematical connection between negative sampling's objective function and implicit matrix factorization?

Related: mikolov2013 skipgram
How can embedding quality for multilingual or low-resource languages be systematically improved? How does vocabulary construction affect embedding learning?

Related: mikolov2013 word2vec

References

Efficient Estimation of Word Representations in Vector Space — Tomas Mikolov et al. (2013)
Word2Vec introduced the concept of word embeddings: training neural networks on large text corpora so semantically similar words cluster in vector space. The famous "king - man + woman ≈ queen" analogy demonstrated its power, laying the foundation for embedding layers in all subsequent language models.
Deep contextualized word representations — Matthew E. Peters et al. (2018)
ELMo introduced contextualized word embeddings: the same word has different vector representations in different contexts (e.g., "bank" in financial vs. riverbank contexts). Using bidirectional LSTMs, ELMo set new SOTA on multiple NLP tasks and laid the conceptual foundation for BERT and subsequent pretrained models.
GloVe: Global Vectors for Word Representation — Jeffrey Pennington et al. (2014)
GloVe learns word vectors by factorizing word co-occurrence matrices, combining the advantages of count-based methods (LSA) and prediction-based methods (Word2Vec). It achieved state-of-the-art on word analogy and similarity tasks and remains a widely used baseline word vector in academia.
Distributed Representations of Words and Phrases and their Compositionality — Tomas Mikolov et al. (2013)
The NeurIPS version of word2vec, introducing Negative Sampling, Hierarchical Softmax, and phrase-level vectors. Influenced all subsequent embedding training objectives including GloVe, fastText, and modern LLM embedding layers.
Convolutional Neural Networks for Sentence Classification — Yoon Kim (2014)
Uses CNN with pre-trained word vectors for text classification, proving "pre-trained embedding + simple architecture" beats hand-crafted features. An early sign of pre-training paradigm entering NLP.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators — Kevin Clark et al. (2020)
Uses replaced token detection instead of MLM, allowing small models to achieve BERT-large level performance. Representative work on "pre-training objective determines sample efficiency".

Embeddings: Putting Discrete Symbols into Continuous Space

Intuition: similar meanings cluster together

Engineering view: vector quality depends on the objective

Example code: semantic relationships in word vectors

Research view: geometry and semantics of vector space

🔬 Open Research Questions

Related Reading

Tokenization: How Models See Text

Positional Encoding: Where Does Order Come From

RAG and Retrieval Augmentation: Giving Models External Memory

Attention: Choosing the Relevant Context

References