DeepSeek v4 - DeveloPassion

# DeepSeek v4 Fourth-generation flagship release from [[Deepseek]] (April 24, 2026). Two open-weight variants — V4-Pro (1.6T total / 49B active parameters) and V4-Flash (284B total / 13B active) — both built on a [[AI Mixture of Experts (MoE)|MoE]] architecture, ship with a 1M-token [[Context Window]] by default, and fold what was the separate R reasoning line into a single model with switchable Thinking / Non-Thinking modes. V4-Pro is the largest [[AI Open Weight Models|open weights]] model released to date. ## What's actually new - **DeepSeek Sparse Attention (DSA) + token-wise compression.** The headline architectural innovation; a content-based variant of [[AI Sparse Attention]]. V4-Pro uses ~27% of the single-token FLOPs and ~10% of the [[AI KV Cache|KV cache]] size of [[DeepSeek V3|DeepSeek V3.2]] at the same context length. Against vanilla full attention the gap is far larger (early reader notes on the paper estimate ~1% of native attention FLOPs and KV size, with throughput improvements on the order of ~50× still to be independently validated). This is an efficiency-first release, not a scale-first one. - **KV cache footprint that fits on commodity hardware.** A full 1M-token context fits in roughly 5.7 GB of KV cache at FP8. For comparison, a Llama-3-405B-class native-attention model would need on the order of ~500 GB to hold the same context. That is what makes 1M-token inference economically real, not just paper-feasible; practitioners report running V4-Flash fully in GPU RAM at 1M context on setups that previously had to spill V3.2 into system memory at 256k. - **Reasoning is no longer a separate model.** The R series is folded into V4 (see [[AI Reasoning Models]]). Both Pro and Flash expose a `reasoning_effort`-style toggle. - **Bitwise batch-invariant, deterministic kernels.** Same input → same output across batch sizes. Most frontier labs trade reproducibility for throughput; DeepSeek deliberately doesn't. - **API surface compatibility.** Native support for both the OpenAI ChatCompletions and [[Anthropic]] API formats out of the box, lowering migration friction. ## Pricing (per million tokens, input / output) | Model | Input | Output | |---|---|---| | DeepSeek V4-Flash | $0.14 | $0.28 | | DeepSeek V4-Pro | $1.74 | $3.48 | | Claude Opus (ref.) | $5 | $25 | | GPT-5.5 (ref.) | $5 | $30 | V4-Pro is the cheapest of the larger frontier models by a wide margin; V4-Flash undercuts even OpenAI's cheapest tier. DeepSeek has signalled further reductions once Huawei Ascend deployment lands in mid-2026. ## Performance positioning V4-Pro rivals top closed-source frontier models and beats all current open models on Math / STEM / Coding benchmarks while preserving stronger world knowledge than other open releases. Independent assessments ([[Simon Willison]], HN practitioners, PicoCreator's reading notes on the paper) consistently place it "between Sonnet and Opus" in feel; ~3–6 months behind absolute SOTA, close enough that the price gap dominates the decision in most agentic / batch workloads. V4-Flash's reasoning capability is reported to closely approach Pro for a fraction of the cost. **Token-economy caveat.** The headline per-token price is the wrong number on its own. On the Artificial Analysis intelligence index, V4-Pro spends ~190M tokens to complete the suite (and [[Kimi K2.6]] ~170M) versus ~45M for GPT-5.5 (high). The 5–15× per-token advantage shrinks (but does not disappear) once you account for verbosity on hard reasoning tasks; the cheaper-per-token model can occasionally cost roughly the same in dollars on the worst cases. The current discount on the official DeepSeek API also makes early comparisons rosier than the steady-state pricing will be; the open-weights release means alternative hosts ([[OpenRouter]], Fireworks, etc.) can fill the gap when official capacity is throttled. ## Why this matters DeepSeek v4 is the clearest signal yet that the frontier is bifurcating along a cost / quality plane rather than a single capability axis. A 6-month-behind, 5-to-15× cheaper open model is the right tool for almost everything that isn't the absolute hardest reasoning step. The DSA + KV-cache reduction also makes ultra-long-context inference economically realistic, not just technically possible — the [[AI Inference]] cost curve just shifted. Early practitioner reports back this up. A non-trivial TypeScript codebase audit (multi-file traversal, type analysis, refactor proposal across two prompts) ran end-to-end on V4-Pro for $0.09; the same task is reported to have cost on the order of $9–$13 on Claude Opus before recent price hikes. A full day of refactor work (many subagents, thousands of changed lines) totalled under $1. The cost ratio collapses on the workloads where verbosity bites (see token caveat above), but on the long tail of "good enough" engineering work it is roughly two orders of magnitude. The real constraint, on day one, is operational: V4-Pro is hit hard with timeouts and rate limits at launch (including via [[OpenRouter]] at peak hours), so V4-Flash, or a third-party host, is the more reliable choice for iterative agent loops until capacity catches up. ## References - Official announcement: https://api-docs.deepseek.com/news/news260424 - Announcement post (X): https://x.com/deepseek_ai/status/2047516922263285776 - Model collection: https://huggingface.co/collections/deepseek-ai/deepseek-v4 - Technical report: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf - Simon Willison's writeup: https://simonwillison.net/2026/Apr/24/deepseek-v4/ - PicoCreator's raw reading notes on the V4 paper (X): https://x.com/picocreator/status/2047625988125954386 - Hacker News, launch-day discussions: https://news.ycombinator.com/item?id=47884971 and https://news.ycombinator.com/item?id=47885014 - Hacker News, V4 in practice (cost, token economy, local deployment): https://news.ycombinator.com/item?id=47977026 - Artificial Analysis pages: https://artificialanalysis.ai/models/deepseek-v4-pro and https://artificialanalysis.ai/models/deepseek-v4-flash ## Related - [[Deepseek]] - [[Large Language Models (LLMs)]] - [[AI Mixture of Experts (MoE)]] - [[AI Open Weight Models]] - [[AI KV Cache]] - [[AI Inference]] - [[Context Window]] - [[Sparse AI Models]] - [[Dense AI Models]] - [[Chain-of-Thought (CoT) prompting]] - [[HuggingFace]] - [[Claude]] - [[ChatGPT]] - [[Anthropic]] - [[OpenAI]] - [[Mistral Small 4]] - [[Kimi K2.6]] - [[GPT-5]] - [[OpenRouter]] - [[OpenCode]] - [[Simon Willison]]