DeepSeek V3 - DeveloPassion

# DeepSeek V3 Third-generation flagship from [[Deepseek]], released December 2024 (V3) and incrementally upgraded through 2025 (V3.1 → V3.2). The release that put DeepSeek on the global frontier-model map: a 671B-total / 37B-active [[AI Mixture of Experts (MoE)|MoE]] [[AI Open Weight Models|open-weights]] model that matched or beat the leading closed Western models on most benchmarks at a tiny fraction of the reported training cost. It is the direct predecessor — and the efficiency baseline — for [[DeepSeek v4]]. ## Architecture - MoE: 671B total parameters, 37B active per token (256 experts, 8 active + 1 shared). - Multi-head Latent Attention (MLA): a low-rank compression of the attention KV pair that cut KV cache size dramatically vs vanilla multi-head attention. - Auxiliary-loss-free load balancing: removed the auxiliary balancing loss used in prior MoE designs without degrading expert utilisation. - Multi-Token Prediction (MTP) training objective: predict multiple future tokens per step for better sample efficiency. - 128K [[Context Window]]. ## Why it mattered - **Training cost shock.** DeepSeek reported training V3 in roughly 2.788M H800 GPU-hours — orders of magnitude below what the West was spending. Whether or not the headline number is fully comparable, it forced a market-wide rethink of what frontier-model training actually has to cost. - **Open-weights at the frontier.** The first time an open-weights model was credibly competitive with GPT-4-class closed models on real benchmarks rather than narrow ones. - **R1 lineage.** V3 became the base for the R1 reasoning line, which later folded back into [[DeepSeek v4]]'s unified Thinking / Non-Thinking modes (see [[AI Reasoning Models]]). - **Architectural toolkit.** MLA, auxiliary-loss-free balancing, and MTP became standard reference techniques for subsequent open MoE releases. ## V3.1 / V3.2 Iterative improvements through 2025 added longer-context behaviour, better tool use, and the first form of [[AI Sparse Attention]] in the family — DSA's predecessor — which V3.2 used to bring inference cost down. V3.2 is the specific reference point against which DeepSeek measures V4's efficiency claims (V4-Pro at ~27% of single-token FLOPs and ~10% of [[AI KV Cache|KV cache]] size of V3.2). ## License and availability - Open weights, MIT-style license (training code restricted, weights and inference code open). - API on platform.deepseek.com. - Distributed via [[HuggingFace]]. ## References - https://github.com/deepseek-ai/DeepSeek-V3 - https://huggingface.co/deepseek-ai/DeepSeek-V3 - Technical report: https://arxiv.org/abs/2412.19437 ## Related - [[Deepseek]] - [[DeepSeek v4]] - [[Large Language Models (LLMs)]] - [[AI Mixture of Experts (MoE)]] - [[AI Open Weight Models]] - [[AI KV Cache]] - [[AI Sparse Attention]] - [[AI Reasoning Models]] - [[Context Window]] - [[HuggingFace]]