# Gemma 4 Gemma 4 is the fourth generation of Google's [[Gemma]] open-weight model family. Released April 2, 2026, it is purpose-built for advanced reasoning and agentic workflows. Licensed under Apache 2.0. ## Model variants | Model | Total Params | Active Params | Layers | Context | Architecture | Modalities | |-------|-------------|---------------|--------|---------|-------------|------------| | E2B | 5.1B (2.3B effective) | 2.3B | 35 | 128K | Dense + PLE | Text, image, audio | | E4B | 8B (4.5B effective) | 4.5B | 42 | 128K | Dense + PLE | Text, image, audio | | 26B A4B | 25.2B | 3.8B | 30 | 256K | [[AI Mixture of Experts (MoE)]] (8 active / 128 total experts) | Text, image | | 31B | 30.7B | 30.7B | 60 | 256K | Dense | Text, image | The "E" prefix stands for "effective parameters" using Per-Layer Embeddings (PLE), maximizing efficiency for on-device use. The 26B MoE variant must load all 26B parameters into memory despite only activating 3.8B per token. ## Architecture Hybrid attention mechanism interleaving local sliding window attention with full global attention. Sliding window sizes: 512 tokens (E2B/E4B), 1024 tokens (26B A4B, 31B). Vocabulary size: 262K tokens across all variants. Vision encoders: ~150M parameters (E2B/E4B), ~550M parameters (26B A4B, 31B). Variable image resolution via configurable token budgets (70, 140, 280, 560, 1120). Audio encoders (~300M params) on E2B/E4B only; supports ASR and speech-to-translated-text up to 30 seconds. Video support via frame sequences, up to 60 seconds. ## Key features - **[[AI Multimodal]]**: text, image, audio (small models), video across all variants - **Built-in reasoning**: configurable thinking mode via `<|think|>` token for step-by-step reasoning - **Native function calling**: structured tool use for [[Agentic Era|agentic]] workflows - **System role support**: native `system` role (new in Gemma 4) - **Multilingual**: 35+ languages out-of-box, trained on 140+ languages - **Long context**: 128K (small) to 256K (medium) token windows ## What's new vs Gemma 3 - Audio modality on small models (E2B, E4B) - Built-in reasoning / thinking mode - Native system role support - 256K context window (up from 128K max) - [[AI Mixture of Experts (MoE)|Mixture-of-Experts]] variant (26B A4B) - Per-Layer Embeddings (PLE) for efficient small models - Significantly improved coding and math benchmarks ## Benchmarks (instruction-tuned) | Benchmark | 31B | 26B A4B | E4B | E2B | |-----------|-----|---------|-----|-----| | MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% | | AIME 2026 | 89.2% | 88.3% | 42.5% | 37.5% | | LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 44.0% | | GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% | | MMMLU | 88.4% | 86.3% | 76.6% | 67.4% | ## Memory requirements | Model | BF16 | SFP8 | Q4_0 | |-------|------|------|------| | E2B | 9.6 GB | 4.6 GB | 3.2 GB | | E4B | 15 GB | 7.5 GB | 5 GB | | 31B | 58.3 GB | 30.4 GB | 17.4 GB | | 26B A4B | 48 GB | 25 GB | 15.6 GB | ## Multi-Token Prediction (MTP) drafters — May 2026 update On 2026-05-05 Google released a companion line of small autoregressive **[[AI Multi-Token Prediction Drafters|drafter]]** models for the Gemma 4 family, alongside a Multi-Token Prediction (MTP) head. These enable [[Speculative Decoding]] at inference time — the drafter predicts several tokens ahead, the target Gemma 4 model verifies them in parallel, and accepted tokens roll out without waiting for token-by-token decoding. Reported speedups: **up to 3× without quality degradation**. On Apple Silicon with mixture-of-experts variants and batch sizes 4–8, ~2.2× decoding speedups. The drafters introduce three architectural enhancements that distinguish them from generic speculative-decoding setups: - **Target activations sharing.** The drafter consumes the final-layer activations of the target model (concatenated with its embeddings) on round 1, then reuses its own activations on subsequent rounds. - **KV cache sharing.** The drafter cross-attends to the target model's KV cache instead of building its own — no redundant prompt re-processing. - **Efficient embedder.** The LM Head uses sparse decoding via clustered token lookup; the drafter only computes logits for the most likely cluster, not the full 262K-token vocabulary. For the broader concept (not Gemma-specific), see [[AI Multi-Token Prediction Drafters]]. Available under Apache 2.0 on Hugging Face and Kaggle; supported in [[Transformers]], [[MLX]], [[vLLM]], [[SGLang]], and [[Ollama]]. ## Run locally Via [[Ollama]]: ```sh ollama run gemma4 ollama run gemma4:e4b ollama run gemma4:27b ``` ## Access - [[Google AI Studio]] (free) - [[Google AI Edge Gallery]] (mobile app) - Vertex AI - Hugging Face: `google/gemma-4-27b-it`, `google/gemma-4-e4b-it` - [[Ollama]]: `gemma4` - ... ## References - Google blog: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ - Multi-Token Prediction announcement (2026-05-05): https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/ - Drafter explainer (Google Gemma on X): https://x.com/googlegemma/status/2051694045869879749 - Model card: https://ai.google.dev/gemma/docs/core/model_card_4 - Gemma docs: https://ai.google.dev/gemma/docs/core ## Related - [[Gemma]] - [[Gemini]] - [[Google AI Edge Gallery]] - [[Large Language Models (LLMs)]] - [[Dense AI Models]] - [[Sparse AI Models]] - [[AI Mixture of Experts (MoE)]] - [[AI Multimodal]] - [[Ollama]] - [[Google AI Studio]] - [[Agentic Era]] - [[AI Multi-Token Prediction Drafters]] - [[Speculative Decoding]] - [[AI Inference]] - [[MLX]] - [[vLLM]] - [[SGLang]] - [[Transformers]]