# AI Tokenization
The process of converting text into numerical tokens that [[Large Language Models (LLMs)]] can process. Subword tokenization algorithms (BPE, SentencePiece, WordPiece) split text into chunks that balance vocabulary size with representation efficiency.
Token count determines cost, [[Context Window]] usage, and processing time. Different models use different tokenizers, so the same text may consume different token counts across models. A rough English average is ~4 characters per token, but this varies significantly for code, non-Latin scripts, and specialized terminology.
Tokenization has downstream consequences: rare words get split into more tokens (costing more and consuming more context). Languages with non-Latin scripts are penalized with higher token counts for equivalent meaning. This creates a structural bias in both cost and capability.
The [[Token Budget]] is the practical constraint that tokenization imposes. Every input and output token counts against it, making efficient tokenization a first-order concern for both cost and capability. [[Embeddings]] are computed per token, meaning tokenization choices directly affect the model's internal representations.
## References
## Related
- [[Large Language Models (LLMs)]]
- [[Context Window]]
- [[Token Budget]]
- [[Embeddings]]
- [[Headroom]]