# Gemma 4
Gemma 4 is the fourth generation of Google's [[Gemma]] open-weight model family. Released April 2, 2026, it is purpose-built for advanced reasoning and agentic workflows. Licensed under Apache 2.0.
## Model variants
| Model | Total Params | Active Params | Layers | Context | Architecture | Modalities |
|-------|-------------|---------------|--------|---------|-------------|------------|
| E2B | 5.1B (2.3B effective) | 2.3B | 35 | 128K | Dense + PLE | Text, image, audio |
| E4B | 8B (4.5B effective) | 4.5B | 42 | 128K | Dense + PLE | Text, image, audio |
| 26B A4B | 25.2B | 3.8B | 30 | 256K | [[AI Mixture of Experts (MoE)]] (8 active / 128 total experts) | Text, image |
| 31B | 30.7B | 30.7B | 60 | 256K | Dense | Text, image |
The "E" prefix stands for "effective parameters" using Per-Layer Embeddings (PLE), maximizing efficiency for on-device use. The 26B MoE variant must load all 26B parameters into memory despite only activating 3.8B per token.
## Architecture
Hybrid attention mechanism interleaving local sliding window attention with full global attention. Sliding window sizes: 512 tokens (E2B/E4B), 1024 tokens (26B A4B, 31B). Vocabulary size: 262K tokens across all variants.
Vision encoders: ~150M parameters (E2B/E4B), ~550M parameters (26B A4B, 31B). Variable image resolution via configurable token budgets (70, 140, 280, 560, 1120). Audio encoders (~300M params) on E2B/E4B only; supports ASR and speech-to-translated-text up to 30 seconds. Video support via frame sequences, up to 60 seconds.
## Key features
- **[[AI Multimodal]]**: text, image, audio (small models), video across all variants
- **Built-in reasoning**: configurable thinking mode via `<|think|>` token for step-by-step reasoning
- **Native function calling**: structured tool use for [[Agentic Era|agentic]] workflows
- **System role support**: native `system` role (new in Gemma 4)
- **Multilingual**: 35+ languages out-of-box, trained on 140+ languages
- **Long context**: 128K (small) to 256K (medium) token windows
## What's new vs Gemma 3
- Audio modality on small models (E2B, E4B)
- Built-in reasoning / thinking mode
- Native system role support
- 256K context window (up from 128K max)
- [[AI Mixture of Experts (MoE)|Mixture-of-Experts]] variant (26B A4B)
- Per-Layer Embeddings (PLE) for efficient small models
- Significantly improved coding and math benchmarks
## Benchmarks (instruction-tuned)
| Benchmark | 31B | 26B A4B | E4B | E2B |
|-----------|-----|---------|-----|-----|
| MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% |
| AIME 2026 | 89.2% | 88.3% | 42.5% | 37.5% |
| LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 44.0% |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% |
| MMMLU | 88.4% | 86.3% | 76.6% | 67.4% |
## Memory requirements
| Model | BF16 | SFP8 | Q4_0 |
|-------|------|------|------|
| E2B | 9.6 GB | 4.6 GB | 3.2 GB |
| E4B | 15 GB | 7.5 GB | 5 GB |
| 31B | 58.3 GB | 30.4 GB | 17.4 GB |
| 26B A4B | 48 GB | 25 GB | 15.6 GB |
## Multi-Token Prediction (MTP) drafters — May 2026 update
On 2026-05-05 Google released a companion line of small autoregressive **[[AI Multi-Token Prediction Drafters|drafter]]** models for the Gemma 4 family, alongside a Multi-Token Prediction (MTP) head. These enable [[Speculative Decoding]] at inference time — the drafter predicts several tokens ahead, the target Gemma 4 model verifies them in parallel, and accepted tokens roll out without waiting for token-by-token decoding.
Reported speedups: **up to 3× without quality degradation**. On Apple Silicon with mixture-of-experts variants and batch sizes 4–8, ~2.2× decoding speedups.
The drafters introduce three architectural enhancements that distinguish them from generic speculative-decoding setups:
- **Target activations sharing.** The drafter consumes the final-layer activations of the target model (concatenated with its embeddings) on round 1, then reuses its own activations on subsequent rounds.
- **KV cache sharing.** The drafter cross-attends to the target model's KV cache instead of building its own — no redundant prompt re-processing.
- **Efficient embedder.** The LM Head uses sparse decoding via clustered token lookup; the drafter only computes logits for the most likely cluster, not the full 262K-token vocabulary.
For the broader concept (not Gemma-specific), see [[AI Multi-Token Prediction Drafters]].
Available under Apache 2.0 on Hugging Face and Kaggle; supported in [[Transformers]], [[MLX]], [[vLLM]], [[SGLang]], and [[Ollama]].
## Run locally
Via [[Ollama]]:
```sh
ollama run gemma4
ollama run gemma4:e4b
ollama run gemma4:27b
```
## Access
- [[Google AI Studio]] (free)
- [[Google AI Edge Gallery]] (mobile app)
- Vertex AI
- Hugging Face: `google/gemma-4-27b-it`, `google/gemma-4-e4b-it`
- [[Ollama]]: `gemma4`
- ...
## References
- Google blog: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
- Multi-Token Prediction announcement (2026-05-05): https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/
- Drafter explainer (Google Gemma on X): https://x.com/googlegemma/status/2051694045869879749
- Model card: https://ai.google.dev/gemma/docs/core/model_card_4
- Gemma docs: https://ai.google.dev/gemma/docs/core
## Related
- [[Gemma]]
- [[Gemini]]
- [[Google AI Edge Gallery]]
- [[Large Language Models (LLMs)]]
- [[Dense AI Models]]
- [[Sparse AI Models]]
- [[AI Mixture of Experts (MoE)]]
- [[AI Multimodal]]
- [[Ollama]]
- [[Google AI Studio]]
- [[Agentic Era]]
- [[AI Multi-Token Prediction Drafters]]
- [[Speculative Decoding]]
- [[AI Inference]]
- [[MLX]]
- [[vLLM]]
- [[SGLang]]
- [[Transformers]]