Embeddings - DeveloPassion

# Embeddings Embeddings are dense numerical representations (vectors) that capture the semantic meaning of data; text, images, audio, or other content. They map high-dimensional data into a continuous vector space where similar items are positioned close together. Embeddings are foundational to modern AI applications, enabling [[Semantic search]], [[Retrieval-Augmented Generation (RAG)]], recommendations, and more. ## How Embeddings Work ``` "The cat sat on the mat" → [0.23, -0.45, 0.12, 0.89, ...] (384-1536 dimensions) Similar meaning = Similar vectors "The kitten rested on the rug" → [0.21, -0.42, 0.15, 0.87, ...] (nearby in vector space) ``` Neural networks learn to map inputs to vectors where: - **Distance** reflects semantic similarity - **Direction** can encode relationships (e.g., king - man + woman ≈ queen) ## Types of Embeddings | Type | Input | Use Cases | |------|-------|-----------| | Text | Words, sentences, documents | Search, RAG, classification | | Image | Photos, diagrams | Visual search, multimodal AI | | Audio | Speech, music | Voice search, transcription | | Code | Source code | Code search, similarity | | Multimodal | Text + images | Cross-modal retrieval | ## Popular Embedding Models ### Text Embeddings - **OpenAI**: `text-embedding-3-small`, `text-embedding-3-large` - **Cohere**: `embed-english-v3.0`, `embed-multilingual-v3.0` - **Sentence Transformers**: Open-source models (SBERT) - **Voyage AI**: Specialized domain embeddings ### Dimensions - Smaller (384-512): Faster, less storage - Larger (1024-1536): More semantic nuance - Trade-off: precision vs. performance/cost ## Similarity Metrics | Metric | Formula | Best For | |--------|---------|----------| | Cosine | angle between vectors | Normalized text | | Euclidean | straight-line distance | General purpose | | Dot Product | magnitude × alignment | When length matters | ## Common Applications - **[[Semantic search]]**: Find content by meaning, not keywords - **[[Retrieval-Augmented Generation (RAG)]]**: Retrieve relevant context for LLMs - **Recommendations**: Find similar items - **Clustering**: Group related content - **Anomaly detection**: Identify outliers - **Deduplication**: Find near-duplicates ## Embedding Pipeline ``` Raw Data → Chunking → Embedding Model → Vectors → [[Vector Store]] ↓ Query → Embedding Model → Query Vector → Similarity Search → Results ``` ## Best Practices - **Chunk wisely**: Balance context vs. precision - **Match models**: Use same model for indexing and queries - **Normalize vectors**: For cosine similarity - **Cache embeddings**: Avoid recomputing - **Monitor drift**: Models and data change over time ## References - https://platform.openai.com/docs/guides/embeddings - https://www.sbert.net/ ## Related - [[Vector Store]] - [[Retrieval-Augmented Generation (RAG)]] - [[RAG Pipelines]] - [[Large Language Models (LLMs)]] - [[LangChain]] - [[Semantic Search]]