LLM Streaming - DeveloPassion

# LLM Streaming Token-by-token output from a [[Large Language Models (LLMs)|LLM]] as the model generates them, rather than waiting for the full response. The standard UX pattern for chatbots and assistants — the user sees text appearing in real time instead of waiting 5-30 seconds for completion. ## Why It Exists LLM inference is autoregressive: each token depends on all previous tokens. Generation is inherently sequential. Without streaming, the user waits for the entire sequence; with streaming, they see progress immediately. Streaming reduces *perceived* latency dramatically even though *total* latency is unchanged. ## How It Works The model emits one token at a time. Streaming APIs deliver each token as soon as it's generated: - HTTP: Server-Sent Events (SSE) or chunked transfer encoding - WebSocket: full-duplex token streaming - In-process (browser, local): async iterators / `ReadableStream` Example (W3C [[Prompt API]]): ```js const stream = session.promptStreaming("Tell me a story"); for await (const chunk of stream) { output += chunk; render(output); } ``` ## UX Implications - **Time to First Token (TTFT)** becomes the key latency metric, not total generation time - Users tolerate longer total responses if streaming feels responsive - Cancellation matters: users want to interrupt long generations (abort signals) - Markdown rendering needs to handle partial output gracefully ## Trade-offs **Pros:** - Better UX (perceived speed) - Earlier feedback (user can cancel if going wrong) - Lower memory pressure for very long outputs **Cons:** - Harder to apply post-processing (formatting, validation) since output is incremental - Some [[LLM Structured Outputs]] modes can't stream cleanly (full JSON validation needs the whole document) - Streaming + tool calling is tricky — the runtime must detect when output stops being prose and starts being a tool call ## Where It Shows Up | API | How | |---|---| | OpenAI API | `stream: true` parameter, SSE | | Anthropic Claude API | `stream: true`, SSE | | W3C [[Prompt API]] | `promptStreaming()` returns async iterable | | [[Gemini Nano]] | Streamed via Prompt API | | Local runtimes | Native (llama.cpp, Ollama, vLLM) | ## References - https://github.com/webmachinelearning/prompt-api ## Related - [[Large Language Models (LLMs)]] - [[LLM Tool Calling]] - [[LLM Structured Outputs]] - [[Prompt API]] - [[Gemini Nano]] - [[AI Inference]] - [[Browser-Provided Language Models]] - [[Time to First Token (TTFT)]] - [[Token Budget]]