# Retrieval-Augmented Generation (RAG) Retrieval-Augmented Generation (RAG) is a technique that enhances [[Large Language Models (LLMs)]] by retrieving relevant information from external data sources before generating responses. Instead of relying solely on training data, RAG models reference authoritative knowledge bases, enabling more accurate, up-to-date, and verifiable outputs. RAG addresses key LLM limitations: hallucination, outdated knowledge, and lack of domain-specific information—all without expensive model retraining. ## How RAG Works ``` ┌─────────────────────────────────────────────────────────┐ │ RAG Pipeline │ │ │ │ ┌─────────┐ ┌─────────────┐ ┌───────────────┐ │ │ │ User │───►│ Retriever │───►│ Generator │ │ │ │ Query │ │ │ │ (LLM) │ │ │ └─────────┘ └──────┬──────┘ └───────┬───────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌─────────────┐ ┌───────────┐ │ │ │ Knowledge │ │ Response │ │ │ │ Base │ │ + Sources │ │ │ └─────────────┘ └───────────┘ │ └─────────────────────────────────────────────────────────┘ ``` 1. **Query**: User submits a question 2. **Retrieve**: System searches knowledge base for relevant documents 3. **Augment**: Retrieved context is added to the prompt 4. **Generate**: LLM produces response using both its training and retrieved context ## Key Benefits - **Reduced hallucination**: Grounded responses with verifiable sources - **Current information**: Access to data beyond training cutoff - **Domain specificity**: Leverage organizational knowledge bases - **Cost efficiency**: No need to retrain or fine-tune models - **Citability**: Models can reference sources like footnotes - **Trust**: Users can verify claims against original documents ## RAG Architectures ### Naive RAG Simple retrieve-then-generate pipeline: - Embed query → Search vector store → Pass top-k results to LLM ### Advanced RAG Improvements to address naive RAG limitations: - **Query rewriting**: Reformulate queries for better retrieval - **Hybrid search**: Combine semantic and keyword search - **Re-ranking**: Score and reorder retrieved documents - **Chunk optimization**: Better document segmentation ### Agentic RAG LLM-assisted query planning with multi-source access: - Agent decides what to retrieve and when - Multiple retrieval steps based on reasoning - Structured responses optimized for downstream use ## Core Components | Component | Purpose | Examples | |-----------|---------|----------| | Embeddings | Convert text to vectors | OpenAI, Cohere, Sentence Transformers | | Vector Store | Store and search embeddings | Pinecone, Weaviate, Chroma, pgvector | | Retriever | Find relevant documents | Dense Passage Retrieval (DPR), BM25 | | Generator | Produce final response | GPT-4, Claude, Llama | ## Challenges - **Low precision**: Retrieved chunks may not align with query intent - **Low recall**: May miss relevant documents - **Chunk boundaries**: Important context split across chunks - **Outdated indices**: Knowledge base may lag behind source data - **Latency**: Retrieval adds processing time - **Evaluation**: Hard to measure retrieval + generation quality together ## RAG Frameworks - **[[LangChain]]**: Comprehensive RAG building blocks - **LlamaIndex**: Specialized for RAG and data indexing - **Haystack**: End-to-end NLP framework with RAG support - **Anthropic Citations API**: Built-in source referencing for Claude ## Use Cases - **Enterprise knowledge management**: Query internal documentation - **Customer support**: Answer questions from product docs - **Legal research**: Search case law and contracts - **Medical information**: Reference clinical guidelines - **Code assistance**: Retrieve relevant code examples ## References - https://en.wikipedia.org/wiki/Retrieval-augmented_generation - https://aws.amazon.com/what-is/retrieval-augmented-generation/ - https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/ ## Related - [[Large Language Models (LLMs)]] - [[LangChain]] - [[LangGraph]] - [[AI Agents]] - [[Database]] - [[SQL]] - [[PostgreSQL]] - [[Vector Store]] - [[RAG Pipelines]] - [[Embeddings]] - [[LLM Wiki]]