# Token Budget
A token budget is the practical allocation strategy for how to use a finite [[Context Window]]. Since the window has a hard limit (measured in tokens), every piece of context competes for space: system instructions, conversation history, retrieved documents, tool definitions, tool outputs, and the model's response all draw from the same pool.
Token budgeting is how practitioners reason about the constraint that [[AI context is finite with diminishing returns]]. The question isn't "how much can I fit?" but "what's the highest-value use of each token?" A well-budgeted context allocates more tokens to high-signal information and aggressively compresses or defers low-signal information.
In practice, token budgeting involves:
- **Prioritization**: deciding what must always be present (core instructions, identity) vs. what can be loaded on demand ([[Prompt Lazy Loading AI Design Pattern (PLL)]])
- **Measurement**: tracking how many tokens each context component consumes
- **Compression**: using [[Context Compression]] techniques when components exceed their allocation
- **Pruning**: part of [[Context Hygiene]]; removing entries that no longer justify their token cost
The token budget is the hard constraint that makes [[Context Engineering]] an optimization problem rather than a wish list.
## References
-
## Related
- [[Context Window]]
- [[AI context is finite with diminishing returns]]
- [[Context Engineering]]
- [[Context Compression]]
- [[Context Hygiene]]
- [[Context Bloat]]
- [[Prompt Lazy Loading AI Design Pattern (PLL)]]
- [[Progressive Disclosure]]
- [[Large Language Models (LLMs)]]
- [[Time to First Token (TTFT)]]
- [[LLM Streaming]]
- [[AI Inference]]