Diffusion Models - DeveloPassion

# Diffusion Models Diffusion Models are a class of [[Generative AI (Gen AI)]] models that learn to generate data by reversing a gradual noising process. The model is trained to denoise images step by step, and at inference time, it starts from pure random noise and iteratively refines it into a coherent output. This approach powers state-of-the-art image generators like Stable Diffusion, DALL-E 3, and Midjourney. Introduced in their modern form by [[Jonathan Ho]] et al. in 2020 ("Denoising Diffusion Probabilistic Models"), diffusion models achieved image quality surpassing GANs while being more stable to train and offering better mode coverage (diversity). The combination of diffusion models with text conditioning (using [[CLIP]] or T5 embeddings) enabled the text-to-image revolution of 2022, fundamentally changing creative work. ## How Diffusion Works ### Forward Process (Training) Gradually add noise to images until they become pure noise: ``` x_0 (clean image) → x_1 → x_2 → ... → x_T (pure noise) +ε +ε +ε ``` ### Reverse Process (Generation) Learn to reverse the noising, then generate by denoising from noise: ``` x_T (noise) → x_{T-1} → x_{T-2} → ... → x_0 (generated image) -ε̂ -ε̂ -ε̂ (predicted noise at each step) ``` ## Visual Process ``` Forward (destroy information): ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ 🐱 │ → │ 🐱░░ │ → │ ░░░░ │ → │ ░░░░░ │ │ (clear) │ │ (noisy) │ │(noisier)│ │ (noise) │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ x_0 x_1 x_2 x_T Reverse (learn to denoise): ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ░░░░░ │ → │ ░░░░ │ → │ 🐱░░ │ → │ 🐱 │ │ (noise) │ │(cleaner)│ │(cleaner)│ │(output) │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ x_T x_{T-1} x_1 x_0 ``` ## Key Architectures | Model | Year | Innovation | |-------|------|------------| | **DDPM** | 2020 | Modern diffusion formulation | | **DDIM** | 2020 | Faster sampling (fewer steps) | | **Guided Diffusion** | 2021 | Classifier guidance | | **GLIDE** | 2021 | Text-guided generation (OpenAI) | | **Stable Diffusion** | 2022 | Latent space diffusion | | **DALL-E 2** | 2022 | CLIP + diffusion | | **DALL-E 3** | 2023 | Better prompt following | | **Sora** | 2024 | Video diffusion | ## Latent Diffusion (Stable Diffusion) Perform diffusion in compressed latent space for efficiency: ``` Image (512×512×3) → Encoder → Latent (64×64×4) → Diffusion → Decoder → Image (VAE) (U-Net) (VAE) ``` Benefits: - Much faster than pixel-space diffusion - Lower memory requirements - Enables high-resolution generation ## Text Conditioning How text prompts guide generation: ``` "A cat wearing a hat" ↓ Text Encoder (CLIP/T5) ↓ Text Embeddings ↓ Cross-Attention in U-Net ↓ Guides denoising toward prompt ``` ## Classifier-Free Guidance Balance between prompt-following and diversity: ``` output = unconditional_output + scale × (conditional_output - unconditional_output) scale = 1: Pure model output scale > 1: Stronger prompt adherence (typical: 7-15) scale >> 1: Over-saturated, less diverse ``` ## Diffusion vs GANs | Aspect | Diffusion | GANs | |--------|-----------|------| | Training stability | Very stable | Mode collapse risk | | Sample diversity | High | Can be limited | | Sample quality | Excellent | Excellent | | Generation speed | Slower (many steps) | Fast (single pass) | | Control | Excellent | Harder | | Likelihood | Tractable | Intractable | ## Notable Systems | System | Organization | Notable Features | |--------|--------------|------------------| | **Stable Diffusion** | Stability AI | Open source, latent diffusion | | **DALL-E 3** | OpenAI | Superior prompt understanding | | **Midjourney** | Midjourney | Artistic quality, community | | **Imagen** | Google | High photorealism | | **Sora** | OpenAI | Video generation | | **Flux** | Black Forest Labs | Open weights, quality | ## Applications - **Text-to-image**: Generate images from descriptions - **Image-to-image**: Transform existing images - **Inpainting**: Fill in missing regions - **Outpainting**: Extend image boundaries - **Super-resolution**: Upscale images - **Video generation**: Sora, Runway Gen-2 - **3D generation**: DreamFusion, Magic3D - **Audio**: AudioLDM, Riffusion ## References - Ho et al. (2020). "Denoising Diffusion Probabilistic Models" - Rombach et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models" - https://en.wikipedia.org/wiki/Diffusion_model ## Related - [[Generative AI (Gen AI)]] - [[Deep Learning]] - [[Stable Diffusion]] - [[DALL-E]] - [[GANs]] - [[CLIP]] - [[Image Generation]] - [[Computer Vision]]