# Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is a technique that enhances [[Large Language Models (LLMs)]] by retrieving relevant information from external data sources before generating responses. Instead of relying solely on training data, RAG models reference authoritative knowledge bases, enabling more accurate, up-to-date, and verifiable outputs.
RAG addresses key LLM limitations: hallucination, outdated knowledge, and lack of domain-specific information—all without expensive model retraining.
## How RAG Works
```
┌─────────────────────────────────────────────────────────┐
│ RAG Pipeline │
│ │
│ ┌─────────┐ ┌─────────────┐ ┌───────────────┐ │
│ │ User │───►│ Retriever │───►│ Generator │ │
│ │ Query │ │ │ │ (LLM) │ │
│ └─────────┘ └──────┬──────┘ └───────┬───────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌───────────┐ │
│ │ Knowledge │ │ Response │ │
│ │ Base │ │ + Sources │ │
│ └─────────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────┘
```
1. **Query**: User submits a question
2. **Retrieve**: System searches knowledge base for relevant documents
3. **Augment**: Retrieved context is added to the prompt
4. **Generate**: LLM produces response using both its training and retrieved context
## Key Benefits
- **Reduced hallucination**: Grounded responses with verifiable sources
- **Current information**: Access to data beyond training cutoff
- **Domain specificity**: Leverage organizational knowledge bases
- **Cost efficiency**: No need to retrain or fine-tune models
- **Citability**: Models can reference sources like footnotes
- **Trust**: Users can verify claims against original documents
## RAG Architectures
### Naive RAG
Simple retrieve-then-generate pipeline:
- Embed query → Search vector store → Pass top-k results to LLM
### Advanced RAG
Improvements to address naive RAG limitations:
- **Query rewriting**: Reformulate queries for better retrieval
- **Hybrid search**: Combine semantic and keyword search
- **Re-ranking**: Score and reorder retrieved documents
- **Chunk optimization**: Better document segmentation
### Agentic RAG
LLM-assisted query planning with multi-source access:
- Agent decides what to retrieve and when
- Multiple retrieval steps based on reasoning
- Structured responses optimized for downstream use
## Core Components
| Component | Purpose | Examples |
|-----------|---------|----------|
| Embeddings | Convert text to vectors | OpenAI, Cohere, Sentence Transformers |
| Vector Store | Store and search embeddings | Pinecone, Weaviate, Chroma, pgvector |
| Retriever | Find relevant documents | Dense Passage Retrieval (DPR), BM25 |
| Generator | Produce final response | GPT-4, Claude, Llama |
## Challenges
- **Low precision**: Retrieved chunks may not align with query intent
- **Low recall**: May miss relevant documents
- **Chunk boundaries**: Important context split across chunks
- **Outdated indices**: Knowledge base may lag behind source data
- **Latency**: Retrieval adds processing time
- **Evaluation**: Hard to measure retrieval + generation quality together
## RAG Frameworks
- **[[LangChain]]**: Comprehensive RAG building blocks
- **LlamaIndex**: Specialized for RAG and data indexing
- **Haystack**: End-to-end NLP framework with RAG support
- **Anthropic Citations API**: Built-in source referencing for Claude
## Use Cases
- **Enterprise knowledge management**: Query internal documentation
- **Customer support**: Answer questions from product docs
- **Legal research**: Search case law and contracts
- **Medical information**: Reference clinical guidelines
- **Code assistance**: Retrieve relevant code examples
## References
- https://en.wikipedia.org/wiki/Retrieval-augmented_generation
- https://aws.amazon.com/what-is/retrieval-augmented-generation/
- https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/
## Related
- [[Large Language Models (LLMs)]]
- [[LangChain]]
- [[LangGraph]]
- [[AI Agents]]
- [[Database]]
- [[SQL]]
- [[PostgreSQL]]
- [[Vector Store]]
- [[RAG Pipelines]]
- [[Embeddings]]
- [[LLM Wiki]]