AI Multi-Token Prediction Drafters

# Multi-token prediction drafters A **multi-token prediction (MTP) drafter** is a small, fast companion model paired with a larger "target" [[Large Language Models (LLMs)|LLM]] to accelerate inference. The drafter speculatively predicts several tokens ahead; the target verifies them in parallel; accepted tokens roll out without paying the per-token decoding cost the target would otherwise incur. It is the architectural shape of [[Speculative Decoding]] taken seriously: instead of bolting any small model onto any big model, the drafter is **co-designed** with the target — sharing its KV cache, consuming its activations, and using a compressed embedder. The result is a 2–3× decoding speedup with negligible quality loss when the drafter agrees with the target. ## The bottleneck it attacks Standard autoregressive LLM inference is **memory-bandwidth bound** on consumer hardware. Each generated token requires moving the model's parameters from memory to the compute units, then producing exactly one token of output. The compute is underused; the bus is the bottleneck. A target model verifying *N* drafted tokens in a single forward pass amortizes that memory transfer across *N* outputs. If the drafter is fast and frequently agrees with the target, you get the latency of a small model with the quality of the large one. ## How a drafter is co-designed Three architectural moves separate a co-designed drafter from a generic small model used for speculative decoding: - **Target activation sharing.** The drafter consumes the *final-layer activations* of the target model on round 1 — concatenated with its own embeddings — instead of reprocessing the prompt from scratch. The expensive prompt encoding the target already paid for is reused. - **KV cache sharing.** The drafter cross-attends to the target's KV cache rather than building its own. KV cache memory is one of the largest costs in modern inference; sharing it is the only way to keep the drafter small without losing context. - **Efficient embedder / sparse LM head.** Modern vocabularies are huge (Gemma 4 has 262K tokens). A naive LM head computes logits over the full vocabulary every step. A clustered sparse head identifies the most likely token cluster first, then computes logits only inside it — a classic two-stage retrieval applied to decoding. These choices turn the drafter from a separate model into a *parasitic* head on the target: tiny, KV-cache-sharing, and trained jointly to mirror the target's distribution where it matters. ## The accept/reject loop For each step: 1. Drafter predicts *N* future tokens autoregressively. 2. Target model verifies all *N* in a single forward pass. 3. Walk the predictions left-to-right. Accept tokens until the target disagrees. 4. The *first* rejected token is replaced with the target's own choice for that position. 5. Resume from there. Drafter starts fresh on the new context. If the drafter is good, *N* tokens land per target forward pass. If the drafter is bad, you waste one drafter call per target call but never produce wrong output — the target's verification step is the correctness gate. ## Why this matters in 2026 Until 2026, speculative decoding was largely a "use whatever small model you have" technique — bolt-on, opportunistic. Two things changed: - **Long contexts pushed KV cache to dominate inference cost.** Sharing the cache went from "nice to have" to "the only way to make this work." - **Frontier labs started shipping drafters as a release artifact alongside the main model.** [[Gemma 4]] was the first major open-weight family to do this at scale (May 2026), with drafters published on Hugging Face and Kaggle and supported across [[Transformers]], [[MLX]], [[vLLM]], [[SGLang]], and [[Ollama]] from day one. Reported speedups: up to 3× without quality degradation; ~2.2× on Apple Silicon for MoE variants with batch sizes 4–8. The pattern is likely to spread. If you ship a model and don't ship a drafter, your inference latency story is incomplete on consumer hardware. ## Where drafters help most - **Memory-bandwidth-bound regimes.** Single-user, batch size 1, consumer GPU or Apple Silicon. The exact regime where most local-LLM users live. - **Predictable text.** Common patterns, code, structured output, repeated phrases — places where a small model is likely to agree with the target. ## Where they help less - **Compute-bound regimes.** Large batch sizes on datacenter GPUs already saturate the compute path; the memory transfer is amortized across users, not tokens. Drafters don't move the needle as much. - **Highly creative or long-tail outputs.** When the drafter rarely agrees with the target, you pay drafter cost on every step without harvesting many accepted tokens. Net speedup approaches 1×. ## Distinct from non-drafter MTP Some recent models (DeepSeek-V3, others) include an MTP *training objective* — the model is trained to predict multiple tokens at once during training, even if inference still emits one at a time. That's a different lever: it improves token-level loss and can be repurposed at inference, but it's not the same architectural artifact as a co-designed drafter that shares the target's KV cache. Don't conflate them. ## References - Google: "Multi-Token Prediction in Gemma 4" — <https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/> - Drafter explainer (Google Gemma on X, 2026-05-05) — <https://x.com/googlegemma/status/2051694045869879749> - Original speculative decoding paper (Leviathan et al., 2023) — <https://arxiv.org/abs/2211.17192> ## Related - [[Speculative Decoding]] - [[Gemma 4]] - [[AI Inference]] - [[Large Language Models (LLMs)]] - [[AI Open Weight Models]] - [[Transformers]] - [[MLX]] - [[vLLM]] - [[SGLang]] - [[Ollama]]