# Dense AI Models
Dense models activate all parameters for every input token. Every layer and every weight participates in each forward pass. This is the standard architecture for most [[Neural Networks (NNs)]] and [[Large Language Models (LLMs)]].
Advantages: simpler to train, easier to optimize and quantize, more predictable behavior. No routing overhead or load-balancing challenges.
Disadvantage: compute scales linearly with parameter count. A 70B dense model uses 70B parameters per token, making inference expensive at large scales. This is the fundamental trade-off that [[Sparse AI Models]] (e.g., [[AI Mixture of Experts (MoE)]]) address.
## Examples
| Model | Parameters | Type |
|-------|-----------|------|
| [[GPT4\|GPT-4]] (rumored dense variant) | ~1.8T | Proprietary |
| [[Claude]] (Anthropic) | Undisclosed | Proprietary |
| [[Gemma 4]] 31B | 30.7B | Open-weight |
| [[Gemma]] 3 27B | 27B | Open-weight |
| LLaMA 3.1 405B | 405B | Open-weight |
| [[Gemma 4]] E2B/E4B | 5.1B / 8B | Open-weight (PLE) |
| [[Granite 4.1]] 3B / 8B / 30B | 3B / 8B / 30B | Open-weight (IBM, [[Apache 2.0 License\|Apache 2.0]]) |
## References
- https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)
## Related
- [[Sparse AI Models]]
- [[AI Mixture of Experts (MoE)]]
- [[Large Language Models (LLMs)]]
- [[Transformers]]
- [[Deep Learning]]
- [[AI Scaling Laws]]