# DeepSeek V3
Third-generation flagship from [[Deepseek]], released December 2024 (V3) and incrementally upgraded through 2025 (V3.1 → V3.2). The release that put DeepSeek on the global frontier-model map: a 671B-total / 37B-active [[AI Mixture of Experts (MoE)|MoE]] [[AI Open Weight Models|open-weights]] model that matched or beat the leading closed Western models on most benchmarks at a tiny fraction of the reported training cost. It is the direct predecessor — and the efficiency baseline — for [[DeepSeek v4]].
## Architecture
- MoE: 671B total parameters, 37B active per token (256 experts, 8 active + 1 shared).
- Multi-head Latent Attention (MLA): a low-rank compression of the attention KV pair that cut KV cache size dramatically vs vanilla multi-head attention.
- Auxiliary-loss-free load balancing: removed the auxiliary balancing loss used in prior MoE designs without degrading expert utilisation.
- Multi-Token Prediction (MTP) training objective: predict multiple future tokens per step for better sample efficiency.
- 128K [[Context Window]].
## Why it mattered
- **Training cost shock.** DeepSeek reported training V3 in roughly 2.788M H800 GPU-hours — orders of magnitude below what the West was spending. Whether or not the headline number is fully comparable, it forced a market-wide rethink of what frontier-model training actually has to cost.
- **Open-weights at the frontier.** The first time an open-weights model was credibly competitive with GPT-4-class closed models on real benchmarks rather than narrow ones.
- **R1 lineage.** V3 became the base for the R1 reasoning line, which later folded back into [[DeepSeek v4]]'s unified Thinking / Non-Thinking modes (see [[AI Reasoning Models]]).
- **Architectural toolkit.** MLA, auxiliary-loss-free balancing, and MTP became standard reference techniques for subsequent open MoE releases.
## V3.1 / V3.2
Iterative improvements through 2025 added longer-context behaviour, better tool use, and the first form of [[AI Sparse Attention]] in the family — DSA's predecessor — which V3.2 used to bring inference cost down. V3.2 is the specific reference point against which DeepSeek measures V4's efficiency claims (V4-Pro at ~27% of single-token FLOPs and ~10% of [[AI KV Cache|KV cache]] size of V3.2).
## License and availability
- Open weights, MIT-style license (training code restricted, weights and inference code open).
- API on platform.deepseek.com.
- Distributed via [[HuggingFace]].
## References
- https://github.com/deepseek-ai/DeepSeek-V3
- https://huggingface.co/deepseek-ai/DeepSeek-V3
- Technical report: https://arxiv.org/abs/2412.19437
## Related
- [[Deepseek]]
- [[DeepSeek v4]]
- [[Large Language Models (LLMs)]]
- [[AI Mixture of Experts (MoE)]]
- [[AI Open Weight Models]]
- [[AI KV Cache]]
- [[AI Sparse Attention]]
- [[AI Reasoning Models]]
- [[Context Window]]
- [[HuggingFace]]