# Embeddings
Embeddings are dense numerical representations (vectors) that capture the semantic meaning of data; text, images, audio, or other content. They map high-dimensional data into a continuous vector space where similar items are positioned close together.
Embeddings are foundational to modern AI applications, enabling [[Semantic search]], [[Retrieval-Augmented Generation (RAG)]], recommendations, and more.
## How Embeddings Work
```
"The cat sat on the mat" → [0.23, -0.45, 0.12, 0.89, ...]
(384-1536 dimensions)
Similar meaning = Similar vectors
"The kitten rested on the rug" → [0.21, -0.42, 0.15, 0.87, ...]
(nearby in vector space)
```
Neural networks learn to map inputs to vectors where:
- **Distance** reflects semantic similarity
- **Direction** can encode relationships (e.g., king - man + woman ≈ queen)
## Types of Embeddings
| Type | Input | Use Cases |
|------|-------|-----------|
| Text | Words, sentences, documents | Search, RAG, classification |
| Image | Photos, diagrams | Visual search, multimodal AI |
| Audio | Speech, music | Voice search, transcription |
| Code | Source code | Code search, similarity |
| Multimodal | Text + images | Cross-modal retrieval |
## Popular Embedding Models
### Text Embeddings
- **OpenAI**: `text-embedding-3-small`, `text-embedding-3-large`
- **Cohere**: `embed-english-v3.0`, `embed-multilingual-v3.0`
- **Sentence Transformers**: Open-source models (SBERT)
- **Voyage AI**: Specialized domain embeddings
### Dimensions
- Smaller (384-512): Faster, less storage
- Larger (1024-1536): More semantic nuance
- Trade-off: precision vs. performance/cost
## Similarity Metrics
| Metric | Formula | Best For |
|--------|---------|----------|
| Cosine | angle between vectors | Normalized text |
| Euclidean | straight-line distance | General purpose |
| Dot Product | magnitude × alignment | When length matters |
## Common Applications
- **[[Semantic search]]**: Find content by meaning, not keywords
- **[[Retrieval-Augmented Generation (RAG)]]**: Retrieve relevant context for LLMs
- **Recommendations**: Find similar items
- **Clustering**: Group related content
- **Anomaly detection**: Identify outliers
- **Deduplication**: Find near-duplicates
## Embedding Pipeline
```
Raw Data → Chunking → Embedding Model → Vectors → [[Vector Store]]
↓
Query → Embedding Model → Query Vector → Similarity Search → Results
```
## Best Practices
- **Chunk wisely**: Balance context vs. precision
- **Match models**: Use same model for indexing and queries
- **Normalize vectors**: For cosine similarity
- **Cache embeddings**: Avoid recomputing
- **Monitor drift**: Models and data change over time
## References
- https://platform.openai.com/docs/guides/embeddings
- https://www.sbert.net/
## Related
- [[Vector Store]]
- [[Retrieval-Augmented Generation (RAG)]]
- [[RAG Pipelines]]
- [[Large Language Models (LLMs)]]
- [[LangChain]]
- [[Semantic Search]]