AI Speculative Decoding - DeveloPassion

# AI Speculative Decoding Speculative decoding is an inference-time technique for accelerating autoregressive [[Large Language Models (LLMs)|LLMs]] without changing their outputs. A small **draft model** predicts several tokens ahead; the larger **target model** verifies all of them in a single forward pass; tokens the target agrees with are accepted; the first disagreement is replaced with the target's own choice. Net effect: you pay one target forward pass for *N* tokens instead of *N* passes for *N* tokens. When the draft model agrees often, latency drops by 2–3× without quality loss — the target's verification step is the correctness gate. Introduced in 2022 (Leviathan et al., "Fast Inference from Transformers via Speculative Decoding") and reframed in parallel by DeepMind. Mainstream by 2024; standardized as a release artifact for open-weight models in 2026 with Google's [[Gemma 4]] [[AI Multi-Token Prediction Drafters|drafters]]. ## The bottleneck Modern LLM inference on consumer hardware is **memory-bandwidth bound**. Each generated token requires moving the model's parameters from memory to compute, then producing one output. Compute is underused; the bus is the bottleneck. Verifying *N* drafted tokens in a single forward pass amortizes the parameter transfer across *N* outputs — the same trick a CPU uses with batch-loaded data. ## The accept/reject loop ``` 1. Draft N tokens autoregressively (cheap, small model) 2. Target model runs ONE forward pass over the prompt + drafted tokens 3. Walk left-to-right. Accept tokens until the target disagrees. 4. The first rejected token is replaced by the target's own choice. 5. Resume from there. Drafter starts fresh on the new context. ``` Mathematically equivalent to sampling from the target model alone — never produces an output the target wouldn't have produced on its own. ## What makes it work or fail **Works well when:** - Memory-bandwidth-bound regime (single user, batch 1, consumer GPU / Apple Silicon) - Predictable text — code, structured output, common phrases — where a small model agrees often with the large one - Draft and target share a tokenizer (otherwise vocabulary translation eats the savings) **Fails to help when:** - Compute-bound regime (large batch on datacenter GPUs already amortizes memory transfers) - Highly creative or long-tail outputs (drafter rarely agrees; speedup approaches 1×) - Draft model is too slow itself (overhead per step exceeds the tokens harvested) ## Variants - **Vanilla speculative decoding** — paired draft/target from any compatible small/large model. - **Self-speculative decoding** — using earlier layers of the same model as the draft head. - **Medusa** — adding extra prediction heads to the target model itself. - **Co-designed drafters** — small models trained jointly with the target, sharing its KV cache and activations. The 2026 frontier; see [[AI Multi-Token Prediction Drafters]]. ## Generic vs co-designed drafters The 2022–2025 era treated draft models as *opportunistic*: pair any small open-weight model with any big one, hope they agree. The 2026 turn — exemplified by [[Gemma 4]] — is *co-designed* drafters that share the target's KV cache, consume its activations, and use a sparse LM head matched to the target's vocabulary. The drafter becomes a parasitic head on the target rather than an independent model. Speedups go from 1.5–2× to 2.5–3× on the same hardware. See [[AI Multi-Token Prediction Drafters]] for the architectural details. ## Distinct from training-time MTP Some models (DeepSeek-V3, others) include a multi-token prediction *training objective* — predicting multiple tokens during training, even if inference still emits one at a time. That's a different lever: it improves token-level loss and can be repurposed at inference, but it's not the same as a co-designed drafter shipped as a separate inference-time artifact. Don't conflate the two. ## References - Original paper: Leviathan, Kalman, Matias — "Fast Inference from Transformers via Speculative Decoding" (2022) — <https://arxiv.org/abs/2211.17192> - DeepMind parallel work: Chen et al. — "Accelerating Large Language Model Decoding with Speculative Sampling" (2023) — <https://arxiv.org/abs/2302.01318> - Google on Gemma 4 drafters (2026-05-05): <https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/> ## Related - [[AI Multi-Token Prediction Drafters]] - [[Gemma 4]] - [[AI Inference]] - [[AI KV Cache]] - [[Large Language Models (LLMs)]] - [[Knowledge Distillation]] - [[AI Open Weight Models]] - [[Transformers]] - [[vLLM]] - [[SGLang]] - [[MLX]] - [[Ollama]]