# Transformers The Transformer is a [[Deep Learning]] architecture introduced in the 2017 paper "Attention Is All You Need" by researchers at Google ([[Ashish Vaswani]] et al.). Unlike previous sequence models (RNNs, LSTMs), Transformers process entire sequences in parallel using self-attention mechanisms, enabling massive scaling and dramatically better performance on [[Natural Language Processing (NLP)]] tasks. This architecture powers virtually all modern AI systems including GPT, BERT, Claude, and even vision models. The key innovation is **self-attention**: each element in a sequence can attend to all other elements, learning which parts are relevant for the current task. This solves the long-range dependency problem that plagued RNNs and enables efficient GPU parallelization. Transformers have become the foundation of [[Large Language Models (LLMs)]] and [[Generative AI (Gen AI)]], with models scaling from millions to trillions of parameters. ## Architecture Overview ``` Input: "The cat sat on the mat" ↓ [Input Embedding + Positional Encoding] ↓ ┌──────────────────────────────────┐ │ Encoder (×N) │ │ ┌────────────────────────────┐ │ │ │ Multi-Head Attention │ │ │ │ ↓ │ │ │ │ Add & Normalize │ │ │ │ ↓ │ │ │ │ Feed Forward │ │ │ │ ↓ │ │ │ │ Add & Normalize │ │ │ └────────────────────────────┘ │ └──────────────────────────────────┘ ↓ ┌──────────────────────────────────┐ │ Decoder (×N) │ │ ┌────────────────────────────┐ │ │ │ Masked Multi-Head Attn │ │ │ │ ↓ │ │ │ │ Cross-Attention │ │ │ │ ↓ │ │ │ │ Feed Forward │ │ │ └────────────────────────────┘ │ └──────────────────────────────────┘ ↓ [Linear + Softmax] ↓ Output Probabilities ``` ## Self-Attention Mechanism Self-attention computes relationships between all positions in a sequence: ``` For each token: Query (Q) = What am I looking for? Key (K) = What do I contain? Value (V) = What information do I provide? Attention(Q, K, V) = softmax(QK^T / √d_k) × V ``` | Component | Purpose | |-----------|---------| | **Query** | The current token's "question" | | **Key** | Each token's searchable identifier | | **Value** | The actual information to retrieve | | **Scaling (√d_k)** | Prevents large dot products | | **Softmax** | Normalizes attention weights | ## Multi-Head Attention Run multiple attention operations in parallel: ``` MultiHead(Q, K, V) = Concat(head_1, ..., head_h) × W_O Each head learns different relationships: - Head 1: Syntactic relationships - Head 2: Semantic relationships - Head 3: Positional patterns - ... ``` ## Key Components | Component | Function | |-----------|----------| | **Embedding** | Convert tokens to vectors | | **Positional Encoding** | Inject sequence position information | | **Multi-Head Attention** | Learn multiple relationship types | | **Feed-Forward Network** | Process attended representations | | **Layer Normalization** | Stabilize training | | **Residual Connections** | Enable deep networks | ## Transformer Variants | Model | Type | Year | Innovation | |-------|------|------|------------| | **BERT** | Encoder-only | 2018 | Bidirectional pretraining | | **GPT** | Decoder-only | 2018 | Autoregressive generation | | **T5** | Encoder-decoder | 2019 | Text-to-text framework | | **ViT** | Vision | 2020 | Images as patch sequences | | **CLIP** | Multimodal | 2021 | Image-text alignment | | **LLaMA** | Decoder-only | 2023 | Efficient open-source LLM | ## Why Transformers Succeeded | Advantage | Compared to RNNs | |-----------|------------------| | **Parallelization** | All positions computed simultaneously | | **Long-range deps** | Direct attention to any position | | **Scalability** | Scales to billions of parameters | | **Transfer learning** | Pretrain once, fine-tune for tasks | ## Transformers Beyond NLP | Domain | Application | |--------|-------------| | **Vision** | ViT, DINO, Swin Transformer | | **Audio** | Whisper, AudioLM | | **Video** | VideoMAE, Sora | | **Proteins** | AlphaFold 2 | | **Robotics** | RT-2, Gato | | **Code** | Codex, StarCoder | ## References - Vaswani et al. (2017). "Attention Is All You Need" - https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture) - https://jalammar.github.io/illustrated-transformer/ ## Related - [[Deep Learning]] - [[Natural Language Processing (NLP)]] - [[Large Language Models (LLMs)]] - [[BERT]] - [[GPT]] - [[Attention Mechanism]] - [[Neural Networks (NNs)]] - [[Generative AI (Gen AI)]]