# AI Attention AI Attention is a mechanism in [[Neural Networks (NNs)]] that enables models to dynamically focus on relevant parts of input data when processing information. Inspired by human cognitive [[Attention types|attention]], it assigns different weights to different parts of the input, allowing the model to "pay attention" to what matters most for the current task. The attention mechanism revolutionized [[Deep Learning]], particularly in natural language processing and computer vision, by solving the bottleneck problem of fixed-length representations in sequence-to-sequence models. ## How Attention Works ### Basic Mechanism At its core, attention computes a weighted sum of values based on their relevance: 1. **Query (Q)**: What we're looking for 2. **Key (K)**: What we're comparing against 3. **Value (V)**: What we retrieve The attention score determines how much focus to place on each part of the input: ``` Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V ``` Where `d_k` is the dimension of the key vectors, used for scaling. ### Self-Attention In self-attention, all queries, keys, and values come from the same input sequence. This allows each position to attend to all positions in the sequence, capturing relationships between words regardless of their distance. Example: In the sentence "The animal didn't cross the street because it was too tired", self-attention helps the model understand that "it" refers to "animal", not "street". ### Multi-Head Attention Used in [[Transformers]], multi-head attention runs multiple attention mechanisms in parallel, allowing the model to focus on different representation subspaces simultaneously: - **Head 1** might focus on syntactic relationships - **Head 2** might focus on semantic relationships - **Head 3** might focus on positional relationships The outputs are concatenated and linearly transformed. ## Types of Attention ### Soft Attention Assigns continuous weights to all input elements. Differentiable and trainable via backpropagation. Used in most modern architectures. ### Hard Attention Selects specific positions to focus on (discrete choice). Non-differentiable; requires reinforcement learning or sampling methods. ### Global vs Local Attention - **Global**: Attends to all source positions - **Local**: Attends to a subset of positions (window-based) ### Cross-Attention Queries come from one sequence, keys and values from another. Essential for encoder-decoder architectures in translation and generation tasks. ## Applications ### Natural Language Processing - [[Large Language Models (LLMs)]] like GPT, BERT, Claude use attention as their core mechanism - Machine translation - Text summarization - Question answering ### Computer Vision - Vision Transformers (ViT) - Object detection - Image captioning - Visual question answering ### Multimodal Systems - Text-to-image generation (DALL-E, Stable Diffusion) - Image-to-text (vision-language models) - Video understanding ## The Transformer Architecture [[Transformers]] are built entirely on attention mechanisms, eliminating the need for recurrence: **Encoder**: Processes input using self-attention and feed-forward layers **Decoder**: Generates output using masked self-attention, cross-attention, and feed-forward layers Key innovations: - **Positional encoding**: Since attention has no inherent notion of order - **Masked attention**: Prevents decoder from seeing future tokens - **Layer normalization**: Stabilizes training - **Residual connections**: Enables deep architectures ## Advantages 1. **Parallelization**: Unlike RNNs, attention can process entire sequences simultaneously 2. **Long-range dependencies**: Directly connects distant positions without information decay 3. **Interpretability**: Attention weights reveal what the model focuses on 4. **Flexibility**: Works across modalities (text, images, audio) ## Limitations 1. **Computational cost**: O(n²) complexity for sequence length n 2. **Memory requirements**: Storing attention matrices for long sequences 3. **Context window limits**: [[AI context is finite with diminishing returns]] 4. **Hallucination**: Can attend to irrelevant patterns in training data ## Efficiency Improvements To address quadratic complexity: - **[[AI Sparse Attention|Sparse attention]]**: Only compute interactions for a subset of token pairs (Longformer, BigBird, DeepSeek Sparse Attention) - **Linear attention**: Approximate attention with linear complexity - **Flash Attention**: Memory-efficient attention implementation - **Chunked attention**: Process inputs in segments ## Attention vs Human Attention While inspired by human [[Attention types|attention]], AI attention differs: - Human attention is selective and resource-constrained by nature - AI attention is computed exhaustively (soft attention) or sampled (hard attention) - Human attention involves conscious and unconscious processes - AI attention is purely mathematical: weighted averaging based on learned similarity However, both serve the same purpose: **allocating limited processing resources to relevant information**. ## Impact on AI Attention mechanisms are central to the current AI revolution: - Enabled [[Large Language Models (LLMs)]] to achieve unprecedented language understanding - Foundation of modern [[AI Agents]] and AI assistants - Core building block in [[AI Major Techniques]] - Powers most state-of-the-art results in NLP and vision The attention mechanism transformed AI from processing fixed representations to dynamically focusing on relevant context, mirroring a fundamental aspect of human cognition. ## References - Bahdanau et al. (2014): "Neural Machine Translation by Jointly Learning to Align and Translate" - Vaswani et al. (2017): "Attention Is All You Need" (introduced Transformers) - https://arxiv.org/abs/1706.03762 ## Related - [[AI (MoC)]] - [[Attention (MoC)]] - [[Transformers]] - [[Large Language Models (LLMs)]] - [[Neural Networks (NNs)]] - [[Deep Learning]] - [[AI Major Techniques]] - [[Attention types]] - [[AI context is finite with diminishing returns]] - [[AI Agents]]