# AI Reasoning Models
[[Large Language Models (LLMs)|LLMs]] that spend explicit inference-time compute generating an internal chain of intermediate steps before producing the final answer. Where a classic LLM emits the answer in a single forward pass, a reasoning model first emits a private (or visible) "thinking" trace — drafts, branches, self-corrections — then summarises it into the user-facing response. The headline mental model: **trade more tokens at inference for better answers, instead of more parameters at training**.
## How they differ from regular LLMs
- **Inference-time compute scaling.** Quality improves with the length of the thinking trace, not just the size of the model — the inference-time analogue of [[AI Scaling Laws|scaling laws]].
- **RL on reasoning traces.** They're typically post-trained with reinforcement learning on verifiable problems (math, code, logic) where correctness can be checked automatically.
- **Visible vs hidden thinking.** Some models expose the trace (DeepSeek R1, open Qwen reasoners); others hide or summarise it (OpenAI o-series).
- **Latency and cost.** A single answer can burn tens of thousands of "thinking" tokens. Cost-per-correct-answer matters more than cost-per-token.
This is the productionised cousin of [[Chain-of-Thought (CoT) prompting]]: instead of relying on the user's prompt to elicit step-by-step thinking, the model is trained to do it natively.
## The convergence — separate model → unified mode
The first generation shipped reasoning as a **separate model**: OpenAI o1 / o3 alongside GPT-4o, DeepSeek R1 alongside V3, QwQ alongside Qwen. Users had to pick the right model up front.
The second generation collapses this into **a single model with a switchable thinking mode**:
- Anthropic [[Claude]] — extended thinking toggle on Opus / Sonnet.
- OpenAI GPT-5 / GPT-5.5 — unified, with a `reasoning_effort` parameter.
- [[DeepSeek v4]] — folds the R line into V4 with Thinking / Non-Thinking modes.
- [[Mistral Small 4]] — `reasoning_effort` from `none` to `high`.
The pattern is now industry-default: one base model, dial reasoning effort up or down per request. Operationally it's much simpler — one routing decision, one billing tier, one memory of the conversation.
## Where they earn their keep
- Math, formal proofs, and competitive programming.
- Multi-step debugging and refactoring across many files.
- Strategic / planning tasks where wrong-direction-early compounds.
- Tasks with verifiable answers (the regime RL post-training is built for).
## Where they don't
- Latency-sensitive interactive UX — the thinking trace destroys time-to-first-token.
- Open-ended creative writing — reasoning models often over-correct toward "safe" outputs.
- Cheap classification, extraction, summarisation — a non-thinking pass is faster and just as accurate.
- Anything where the user pays per-token without per-answer value being measured.
The right default is **non-thinking by default, thinking when verifiability or planning depth justifies the cost**. Pretending every request needs reasoning is a fast path to a bloated [[AI Cost Management|AI bill]].
## References
-
## Related
- [[Large Language Models (LLMs)]]
- [[Chain-of-Thought (CoT) prompting]]
- [[AI Scaling Laws]]
- [[AI Cost Management]]
- [[AI Inference]]
- [[AI Frontier Model]]
- [[DeepSeek v4]]
- [[Claude]]
- [[Mistral Small 4]]