# LLM Streaming
Token-by-token output from a [[Large Language Models (LLMs)|LLM]] as the model generates them, rather than waiting for the full response. The standard UX pattern for chatbots and assistants — the user sees text appearing in real time instead of waiting 5-30 seconds for completion.
## Why It Exists
LLM inference is autoregressive: each token depends on all previous tokens. Generation is inherently sequential. Without streaming, the user waits for the entire sequence; with streaming, they see progress immediately.
Streaming reduces *perceived* latency dramatically even though *total* latency is unchanged.
## How It Works
The model emits one token at a time. Streaming APIs deliver each token as soon as it's generated:
- HTTP: Server-Sent Events (SSE) or chunked transfer encoding
- WebSocket: full-duplex token streaming
- In-process (browser, local): async iterators / `ReadableStream`
Example (W3C [[Prompt API]]):
```js
const stream = session.promptStreaming("Tell me a story");
for await (const chunk of stream) {
output += chunk;
render(output);
}
```
## UX Implications
- **Time to First Token (TTFT)** becomes the key latency metric, not total generation time
- Users tolerate longer total responses if streaming feels responsive
- Cancellation matters: users want to interrupt long generations (abort signals)
- Markdown rendering needs to handle partial output gracefully
## Trade-offs
**Pros:**
- Better UX (perceived speed)
- Earlier feedback (user can cancel if going wrong)
- Lower memory pressure for very long outputs
**Cons:**
- Harder to apply post-processing (formatting, validation) since output is incremental
- Some [[LLM Structured Outputs]] modes can't stream cleanly (full JSON validation needs the whole document)
- Streaming + tool calling is tricky — the runtime must detect when output stops being prose and starts being a tool call
## Where It Shows Up
| API | How |
|---|---|
| OpenAI API | `stream: true` parameter, SSE |
| Anthropic Claude API | `stream: true`, SSE |
| W3C [[Prompt API]] | `promptStreaming()` returns async iterable |
| [[Gemini Nano]] | Streamed via Prompt API |
| Local runtimes | Native (llama.cpp, Ollama, vLLM) |
## References
- https://github.com/webmachinelearning/prompt-api
## Related
- [[Large Language Models (LLMs)]]
- [[LLM Tool Calling]]
- [[LLM Structured Outputs]]
- [[Prompt API]]
- [[Gemini Nano]]
- [[AI Inference]]
- [[Browser-Provided Language Models]]
- [[Time to First Token (TTFT)]]
- [[Token Budget]]