RAG and Retrieval Augmentation: Giving Models External Memory

Intuition: open-book exams are more accurate than closed-book

An LLM’s knowledge comes from pretraining data, which has a cutoff date and may contain errors. RAG (Retrieval-Augmented Generation) works by looking up relevant documents before answering, inserting them into the prompt, and generating an answer grounded in those sources. It is like an open-book exam: the model does not rely solely on memory, but can cite external information.

RAG’s core advantages: knowledge can be updated (just swap the database), answers are traceable (you know which document they came from), and hallucinations are reduced (answers are grounded rather than invented).

Engineering view: chunking, embedding, reranking, and generation

A complete RAG system has multiple stages, each with engineering trade-offs:

Document processing: Long documents must be chunked. Chunks that are too large may exceed the context window; chunks that are too small may lose contextual semantics. Common strategies include fixed-length, paragraph-based, semantic-boundary, or recursive chunking.
Embedding and indexing: Use an embedding model to turn text into vectors and store them in a vector database (e.g., FAISS, Milvus, Pinecone). Evaluate recall: does Top-k retrieval include the truly relevant documents?
Query optimization: The user’s original query may be poorly expressed. HyDE (Hypothetical Document Embeddings) has the model generate a hypothetical answer first, then retrieve using that answer; query rewriting, expansion, and routing are also common.
Reranking: Use a lightweight model to recall many candidates, then a stronger cross-encoder to precisely rerank, balancing speed and accuracy.
Generation and citation: When拼接 retrieved results into the prompt, watch for ordering, redundancy, and conflicts. Ask the model to provide citations so users can verify.

Evaluating RAG requires more than generation quality: you must also measure retrieval recall, answer faithfulness, and end-to-end latency. A common failure mode is “retrieved but not used,” indicating a misalignment between the retrieval and generation stages.

Research view: the boundary and new directions of RAG

A hot research topic is the relationship between RAG and long-context models: if a model can read an entire book, is retrieval still needed? The current consensus is that retrieval still has advantages in precision, updatability, and computational efficiency, but the two are converging—models can autonomously decide when and what to retrieve.

Frontier directions include: adaptive retrieval (only look up when uncertain), multi-hop reasoning (tracing clues across documents), structured RAG (combining knowledge graphs, SQL databases, etc.), and end-to-end differentiable retrieval (letting the model learn how to retrieve).

🔬 Open Research Questions

Key questions and research directions in this area:

How can retrieval and generation be co-optimized? When should more documents be retrieved vs. trusting the model's parametric memory?

Related: lewis2020 rag , borgeaud2022 retro
What is the theoretical foundation of query expansion methods like HyDE? Under what conditions can they stably improve recall?

Related: gao2022 hyde
What unique challenges does multimodal RAG (e.g., combining CLIP) face? How can the alignment quality of image-text retrieval be guaranteed?

Related: radford2021 clip

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Patrick Lewis et al. (2020)
RAG (Retrieval-Augmented Generation) combines pretrained LMs with information retrieval: for each query, retrieve relevant documents from a knowledge base, then generate answers with the documents in context. This addresses LLM knowledge staleness and hallucination, and is now a core architecture in enterprise AI applications.
Dense Passage Retrieval for Open-Domain Question Answering — Vladimir Karpukhin et al. (2020)
Dual-tower BERT + in-batch negatives trains first industrial-grade dense retriever, virtually eliminating BM25 overnight. Today's vector search (FAISS, pgvector) engineering paradigm solidified here.
Precise Zero-Shot Dense Retrieval without Relevance Labels — Luyu Gao et al. (2022)
Makes LLM "pretend" to generate an answer first, then uses its embedding to retrieve real documents. Zero-shot, strong generalization—one of the most reused retrieval enhancement tricks in RAG era.
Improving language models by retrieving from trillions of tokens — Sebastian Borgeaud et al. (2022)
DeepMind introduces chunked retrieval during pre-training, making 7B model match 175B GPT-3. Proves retrieval isn't just RAG inference trick, but another possible pre-training paradigm.
Learning Transferable Visual Models From Natural Language Supervision — Alec Radford et al. (2021)
Uses 400M image-text pairs for contrastive learning to obtain universal vision encoder. CLIP embeddings remain the vision frontend for almost all multimodal systems (DALL·E, Stable Diffusion, LLaVA) today.
Visual Instruction Tuning — Haotian Liu et al. (2023)
CLIP vision encoder + LLaMA + GPT-4 synthesized multimodal instruction data creates first open-source GPT-4V style model with minimal compute. Starting point for open-source multimodal ecosystem (LLaVA-1.5/1.6, Qwen-VL, InternVL).

RAG and Retrieval Augmentation: Giving Models External Memory

Intuition: open-book exams are more accurate than closed-book

Engineering view: chunking, embedding, reranking, and generation

Research view: the boundary and new directions of RAG

🔬 Open Research Questions

Related Reading

Embeddings: Putting Discrete Symbols into Continuous Space

Agents and Tool Use: Models Are More Than Chat

Long Context: Helping Models Read Farther

Prompt Engineering: The Art of Talking to Models

References