# AI Speculative Decoding
Speculative decoding is an inference-time technique for accelerating autoregressive [[Large Language Models (LLMs)|LLMs]] without changing their outputs. A small **draft model** predicts several tokens ahead; the larger **target model** verifies all of them in a single forward pass; tokens the target agrees with are accepted; the first disagreement is replaced with the target's own choice.
Net effect: you pay one target forward pass for *N* tokens instead of *N* passes for *N* tokens. When the draft model agrees often, latency drops by 2–3× without quality loss — the target's verification step is the correctness gate.
Introduced in 2022 (Leviathan et al., "Fast Inference from Transformers via Speculative Decoding") and reframed in parallel by DeepMind. Mainstream by 2024; standardized as a release artifact for open-weight models in 2026 with Google's [[Gemma 4]] [[AI Multi-Token Prediction Drafters|drafters]].
## The bottleneck
Modern LLM inference on consumer hardware is **memory-bandwidth bound**. Each generated token requires moving the model's parameters from memory to compute, then producing one output. Compute is underused; the bus is the bottleneck. Verifying *N* drafted tokens in a single forward pass amortizes the parameter transfer across *N* outputs — the same trick a CPU uses with batch-loaded data.
## The accept/reject loop
```
1. Draft N tokens autoregressively (cheap, small model)
2. Target model runs ONE forward pass over the prompt + drafted tokens
3. Walk left-to-right. Accept tokens until the target disagrees.
4. The first rejected token is replaced by the target's own choice.
5. Resume from there. Drafter starts fresh on the new context.
```
Mathematically equivalent to sampling from the target model alone — never produces an output the target wouldn't have produced on its own.
## What makes it work or fail
**Works well when:**
- Memory-bandwidth-bound regime (single user, batch 1, consumer GPU / Apple Silicon)
- Predictable text — code, structured output, common phrases — where a small model agrees often with the large one
- Draft and target share a tokenizer (otherwise vocabulary translation eats the savings)
**Fails to help when:**
- Compute-bound regime (large batch on datacenter GPUs already amortizes memory transfers)
- Highly creative or long-tail outputs (drafter rarely agrees; speedup approaches 1×)
- Draft model is too slow itself (overhead per step exceeds the tokens harvested)
## Variants
- **Vanilla speculative decoding** — paired draft/target from any compatible small/large model.
- **Self-speculative decoding** — using earlier layers of the same model as the draft head.
- **Medusa** — adding extra prediction heads to the target model itself.
- **Co-designed drafters** — small models trained jointly with the target, sharing its KV cache and activations. The 2026 frontier; see [[AI Multi-Token Prediction Drafters]].
## Generic vs co-designed drafters
The 2022–2025 era treated draft models as *opportunistic*: pair any small open-weight model with any big one, hope they agree. The 2026 turn — exemplified by [[Gemma 4]] — is *co-designed* drafters that share the target's KV cache, consume its activations, and use a sparse LM head matched to the target's vocabulary. The drafter becomes a parasitic head on the target rather than an independent model. Speedups go from 1.5–2× to 2.5–3× on the same hardware. See [[AI Multi-Token Prediction Drafters]] for the architectural details.
## Distinct from training-time MTP
Some models (DeepSeek-V3, others) include a multi-token prediction *training objective* — predicting multiple tokens during training, even if inference still emits one at a time. That's a different lever: it improves token-level loss and can be repurposed at inference, but it's not the same as a co-designed drafter shipped as a separate inference-time artifact. Don't conflate the two.
## References
- Original paper: Leviathan, Kalman, Matias — "Fast Inference from Transformers via Speculative Decoding" (2022) — <https://arxiv.org/abs/2211.17192>
- DeepMind parallel work: Chen et al. — "Accelerating Large Language Model Decoding with Speculative Sampling" (2023) — <https://arxiv.org/abs/2302.01318>
- Google on Gemma 4 drafters (2026-05-05): <https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/>
## Related
- [[AI Multi-Token Prediction Drafters]]
- [[Gemma 4]]
- [[AI Inference]]
- [[AI KV Cache]]
- [[Large Language Models (LLMs)]]
- [[Knowledge Distillation]]
- [[AI Open Weight Models]]
- [[Transformers]]
- [[vLLM]]
- [[SGLang]]
- [[MLX]]
- [[Ollama]]