# AI Tokenization The process of converting text into numerical tokens that [[Large Language Models (LLMs)]] can process. Subword tokenization algorithms (BPE, SentencePiece, WordPiece) split text into chunks that balance vocabulary size with representation efficiency. Token count determines cost, [[Context Window]] usage, and processing time. Different models use different tokenizers, so the same text may consume different token counts across models. A rough English average is ~4 characters per token, but this varies significantly for code, non-Latin scripts, and specialized terminology. Tokenization has downstream consequences: rare words get split into more tokens (costing more and consuming more context). Languages with non-Latin scripts are penalized with higher token counts for equivalent meaning. This creates a structural bias in both cost and capability. The [[Token Budget]] is the practical constraint that tokenization imposes. Every input and output token counts against it, making efficient tokenization a first-order concern for both cost and capability. [[Embeddings]] are computed per token, meaning tokenization choices directly affect the model's internal representations. ## References ## Related - [[Large Language Models (LLMs)]] - [[Context Window]] - [[Token Budget]] - [[Embeddings]] - [[Headroom]]