# Token Budget A token budget is the practical allocation strategy for how to use a finite [[Context Window]]. Since the window has a hard limit (measured in tokens), every piece of context competes for space: system instructions, conversation history, retrieved documents, tool definitions, tool outputs, and the model's response all draw from the same pool. Token budgeting is how practitioners reason about the constraint that [[AI context is finite with diminishing returns]]. The question isn't "how much can I fit?" but "what's the highest-value use of each token?" A well-budgeted context allocates more tokens to high-signal information and aggressively compresses or defers low-signal information. In practice, token budgeting involves: - **Prioritization**: deciding what must always be present (core instructions, identity) vs. what can be loaded on demand ([[Prompt Lazy Loading AI Design Pattern (PLL)]]) - **Measurement**: tracking how many tokens each context component consumes - **Compression**: using [[Context Compression]] techniques when components exceed their allocation - **Pruning**: part of [[Context Hygiene]]; removing entries that no longer justify their token cost The token budget is the hard constraint that makes [[Context Engineering]] an optimization problem rather than a wish list. ## References - ## Related - [[Context Window]] - [[AI context is finite with diminishing returns]] - [[Context Engineering]] - [[Context Compression]] - [[Context Hygiene]] - [[Context Bloat]] - [[Prompt Lazy Loading AI Design Pattern (PLL)]] - [[Progressive Disclosure]] - [[Large Language Models (LLMs)]] - [[Time to First Token (TTFT)]] - [[LLM Streaming]] - [[AI Inference]]