# Dense AI Models Dense models activate all parameters for every input token. Every layer and every weight participates in each forward pass. This is the standard architecture for most [[Neural Networks (NNs)]] and [[Large Language Models (LLMs)]]. Advantages: simpler to train, easier to optimize and quantize, more predictable behavior. No routing overhead or load-balancing challenges. Disadvantage: compute scales linearly with parameter count. A 70B dense model uses 70B parameters per token, making inference expensive at large scales. This is the fundamental trade-off that [[Sparse AI Models]] (e.g., [[AI Mixture of Experts (MoE)]]) address. ## Examples | Model | Parameters | Type | |-------|-----------|------| | [[GPT4\|GPT-4]] (rumored dense variant) | ~1.8T | Proprietary | | [[Claude]] (Anthropic) | Undisclosed | Proprietary | | [[Gemma 4]] 31B | 30.7B | Open-weight | | [[Gemma]] 3 27B | 27B | Open-weight | | LLaMA 3.1 405B | 405B | Open-weight | | [[Gemma 4]] E2B/E4B | 5.1B / 8B | Open-weight (PLE) | | [[Granite 4.1]] 3B / 8B / 30B | 3B / 8B / 30B | Open-weight (IBM, [[Apache 2.0 License\|Apache 2.0]]) | ## References - https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture) ## Related - [[Sparse AI Models]] - [[AI Mixture of Experts (MoE)]] - [[Large Language Models (LLMs)]] - [[Transformers]] - [[Deep Learning]] - [[AI Scaling Laws]]