Time to First Token (TTFT)

# Time to First Token (TTFT) The latency between sending a prompt to an LLM and receiving the first token of its response. The headline UX metric for streaming chatbots and assistants — what users actually feel as "speed". Distinct from total generation time and from inter-token latency. ## Why It Matters When responses [[LLM Streaming|stream]] token-by-token, users perceive responsiveness from the moment text starts appearing. A 30-second response that begins streaming after 200ms feels instant. The same response, batched and delivered after 30 seconds, feels broken. TTFT is the LLM equivalent of "time to interactive" in web performance. ## What Determines TTFT | Component | Typical contribution | |---|---| | Network round-trip (cloud) | 50-200ms | | Queue / batching wait | 0-2000ms (variable) | | **Prefill** — processing the prompt tokens | Scales with prompt length | | First-token sampling | Negligible | The dominant cost on long prompts is prefill: the model must compute attention over every input token before it can generate the first output. Doubling prompt length roughly doubles TTFT. ## Companion Metrics - **TPOT** (Time Per Output Token) / **ITL** (Inter-Token Latency) — pace of streamed output after the first token - **Total latency** — TTFT + (output tokens × ITL) - **Throughput** (tokens/sec) — server-side aggregate metric ## Optimization Levers For application owners: - **Shorter prompts** → faster prefill (drop unused context) - **Prompt caching** → reuse computed prefix attention (Anthropic, OpenAI, vLLM) - **Geographic routing** → minimize network RTT - **Smaller models** → faster prefill per token For inference engine builders: - Continuous batching (vLLM, TGI) - Speculative decoding (helps TPOT more than TTFT) - Prefill-decode disaggregation - KV cache offload / reuse ## On-Device Implications [[On-Device Machine Learning]] eliminates the network component entirely, dropping TTFT to single-digit milliseconds for short prompts. This is a major UX advantage of [[Browser-Provided Language Models]] like [[Gemini Nano]] over cloud APIs. ## References - https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices ## Related - [[LLM Streaming]] - [[Large Language Models (LLMs)]] - [[AI Inference]] - [[Token Budget]] - [[Prompt API]] - [[On-Device Machine Learning]] - [[Browser-Provided Language Models]]