RAG and Retrieval Augmentation: Giving Models External Memory
Intuition: open-book exams are more accurate than closed-book
Section titled “Intuition: open-book exams are more accurate than closed-book”An LLM’s knowledge comes from pretraining data, which has a cutoff date and may contain errors. RAG (Retrieval-Augmented Generation) works by looking up relevant documents before answering, inserting them into the prompt, and generating an answer grounded in those sources. It is like an open-book exam: the model does not rely solely on memory, but can cite external information.
RAG’s core advantages: knowledge can be updated (just swap the database), answers are traceable (you know which document they came from), and hallucinations are reduced (answers are grounded rather than invented).
Engineering view: chunking, embedding, reranking, and generation
Section titled “Engineering view: chunking, embedding, reranking, and generation”A complete RAG system has multiple stages, each with engineering trade-offs:
- Document processing: Long documents must be chunked. Chunks that are too large may exceed the context window; chunks that are too small may lose contextual semantics. Common strategies include fixed-length, paragraph-based, semantic-boundary, or recursive chunking.
- Embedding and indexing: Use an embedding model to turn text into vectors and store them in a vector database (e.g., FAISS, Milvus, Pinecone). Evaluate recall: does Top-k retrieval include the truly relevant documents?
- Query optimization: The user’s original query may be poorly expressed. HyDE (Hypothetical Document Embeddings) has the model generate a hypothetical answer first, then retrieve using that answer; query rewriting, expansion, and routing are also common.
- Reranking: Use a lightweight model to recall many candidates, then a stronger cross-encoder to precisely rerank, balancing speed and accuracy.
- Generation and citation: When拼接 retrieved results into the prompt, watch for ordering, redundancy, and conflicts. Ask the model to provide citations so users can verify.
Evaluating RAG requires more than generation quality: you must also measure retrieval recall, answer faithfulness, and end-to-end latency. A common failure mode is “retrieved but not used,” indicating a misalignment between the retrieval and generation stages.
Research view: the boundary and new directions of RAG
Section titled “Research view: the boundary and new directions of RAG”A hot research topic is the relationship between RAG and long-context models: if a model can read an entire book, is retrieval still needed? The current consensus is that retrieval still has advantages in precision, updatability, and computational efficiency, but the two are converging—models can autonomously decide when and what to retrieve.
Frontier directions include: adaptive retrieval (only look up when uncertain), multi-hop reasoning (tracing clues across documents), structured RAG (combining knowledge graphs, SQL databases, etc.), and end-to-end differentiable retrieval (letting the model learn how to retrieve).
References
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG (Retrieval-Augmented Generation) combines pretrained LMs with information retrieval: for each query, retrieve relevant documents from a knowledge base, then generate answers with the documents in context. This addresses LLM knowledge staleness and hallucination, and is now a core architecture in enterprise AI applications.
- karpukhin2020-dpr
Dual-tower BERT + in-batch negatives trains first industrial-grade dense retriever, virtually eliminating BM25 overnight. Today's vector search (FAISS, pgvector) engineering paradigm solidified here.
- gao2022-hyde
Makes LLM "pretend" to generate an answer first, then uses its embedding to retrieve real documents. Zero-shot, strong generalization—one of the most reused retrieval enhancement tricks in RAG era.
- borgeaud2022-retro
DeepMind introduces chunked retrieval during pre-training, making 7B model match 175B GPT-3. Proves retrieval isn't just RAG inference trick, but another possible pre-training paradigm.