# Transformers
The Transformer is a [[Deep Learning]] architecture introduced in the 2017 paper "Attention Is All You Need" by researchers at Google ([[Ashish Vaswani]] et al.). Unlike previous sequence models (RNNs, LSTMs), Transformers process entire sequences in parallel using self-attention mechanisms, enabling massive scaling and dramatically better performance on [[Natural Language Processing (NLP)]] tasks. This architecture powers virtually all modern AI systems including GPT, BERT, Claude, and even vision models.
The key innovation is **self-attention**: each element in a sequence can attend to all other elements, learning which parts are relevant for the current task. This solves the long-range dependency problem that plagued RNNs and enables efficient GPU parallelization. Transformers have become the foundation of [[Large Language Models (LLMs)]] and [[Generative AI (Gen AI)]], with models scaling from millions to trillions of parameters.
## Architecture Overview
```
Input: "The cat sat on the mat"
↓
[Input Embedding + Positional Encoding]
↓
┌──────────────────────────────────┐
│ Encoder (×N) │
│ ┌────────────────────────────┐ │
│ │ Multi-Head Attention │ │
│ │ ↓ │ │
│ │ Add & Normalize │ │
│ │ ↓ │ │
│ │ Feed Forward │ │
│ │ ↓ │ │
│ │ Add & Normalize │ │
│ └────────────────────────────┘ │
└──────────────────────────────────┘
↓
┌──────────────────────────────────┐
│ Decoder (×N) │
│ ┌────────────────────────────┐ │
│ │ Masked Multi-Head Attn │ │
│ │ ↓ │ │
│ │ Cross-Attention │ │
│ │ ↓ │ │
│ │ Feed Forward │ │
│ └────────────────────────────┘ │
└──────────────────────────────────┘
↓
[Linear + Softmax]
↓
Output Probabilities
```
## Self-Attention Mechanism
Self-attention computes relationships between all positions in a sequence:
```
For each token:
Query (Q) = What am I looking for?
Key (K) = What do I contain?
Value (V) = What information do I provide?
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
```
| Component | Purpose |
|-----------|---------|
| **Query** | The current token's "question" |
| **Key** | Each token's searchable identifier |
| **Value** | The actual information to retrieve |
| **Scaling (√d_k)** | Prevents large dot products |
| **Softmax** | Normalizes attention weights |
## Multi-Head Attention
Run multiple attention operations in parallel:
```
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) × W_O
Each head learns different relationships:
- Head 1: Syntactic relationships
- Head 2: Semantic relationships
- Head 3: Positional patterns
- ...
```
## Key Components
| Component | Function |
|-----------|----------|
| **Embedding** | Convert tokens to vectors |
| **Positional Encoding** | Inject sequence position information |
| **Multi-Head Attention** | Learn multiple relationship types |
| **Feed-Forward Network** | Process attended representations |
| **Layer Normalization** | Stabilize training |
| **Residual Connections** | Enable deep networks |
## Transformer Variants
| Model | Type | Year | Innovation |
|-------|------|------|------------|
| **BERT** | Encoder-only | 2018 | Bidirectional pretraining |
| **GPT** | Decoder-only | 2018 | Autoregressive generation |
| **T5** | Encoder-decoder | 2019 | Text-to-text framework |
| **ViT** | Vision | 2020 | Images as patch sequences |
| **CLIP** | Multimodal | 2021 | Image-text alignment |
| **LLaMA** | Decoder-only | 2023 | Efficient open-source LLM |
## Why Transformers Succeeded
| Advantage | Compared to RNNs |
|-----------|------------------|
| **Parallelization** | All positions computed simultaneously |
| **Long-range deps** | Direct attention to any position |
| **Scalability** | Scales to billions of parameters |
| **Transfer learning** | Pretrain once, fine-tune for tasks |
## Transformers Beyond NLP
| Domain | Application |
|--------|-------------|
| **Vision** | ViT, DINO, Swin Transformer |
| **Audio** | Whisper, AudioLM |
| **Video** | VideoMAE, Sora |
| **Proteins** | AlphaFold 2 |
| **Robotics** | RT-2, Gato |
| **Code** | Codex, StarCoder |
## References
- Vaswani et al. (2017). "Attention Is All You Need"
- https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)
- https://jalammar.github.io/illustrated-transformer/
## Related
- [[Deep Learning]]
- [[Natural Language Processing (NLP)]]
- [[Large Language Models (LLMs)]]
- [[BERT]]
- [[GPT]]
- [[Attention Mechanism]]
- [[Neural Networks (NNs)]]
- [[Generative AI (Gen AI)]]