# AI Reasoning Models [[Large Language Models (LLMs)|LLMs]] that spend explicit inference-time compute generating an internal chain of intermediate steps before producing the final answer. Where a classic LLM emits the answer in a single forward pass, a reasoning model first emits a private (or visible) "thinking" trace — drafts, branches, self-corrections — then summarises it into the user-facing response. The headline mental model: **trade more tokens at inference for better answers, instead of more parameters at training**. ## How they differ from regular LLMs - **Inference-time compute scaling.** Quality improves with the length of the thinking trace, not just the size of the model — the inference-time analogue of [[AI Scaling Laws|scaling laws]]. - **RL on reasoning traces.** They're typically post-trained with reinforcement learning on verifiable problems (math, code, logic) where correctness can be checked automatically. - **Visible vs hidden thinking.** Some models expose the trace (DeepSeek R1, open Qwen reasoners); others hide or summarise it (OpenAI o-series). - **Latency and cost.** A single answer can burn tens of thousands of "thinking" tokens. Cost-per-correct-answer matters more than cost-per-token. This is the productionised cousin of [[Chain-of-Thought (CoT) prompting]]: instead of relying on the user's prompt to elicit step-by-step thinking, the model is trained to do it natively. ## The convergence — separate model → unified mode The first generation shipped reasoning as a **separate model**: OpenAI o1 / o3 alongside GPT-4o, DeepSeek R1 alongside V3, QwQ alongside Qwen. Users had to pick the right model up front. The second generation collapses this into **a single model with a switchable thinking mode**: - Anthropic [[Claude]] — extended thinking toggle on Opus / Sonnet. - OpenAI GPT-5 / GPT-5.5 — unified, with a `reasoning_effort` parameter. - [[DeepSeek v4]] — folds the R line into V4 with Thinking / Non-Thinking modes. - [[Mistral Small 4]] — `reasoning_effort` from `none` to `high`. The pattern is now industry-default: one base model, dial reasoning effort up or down per request. Operationally it's much simpler — one routing decision, one billing tier, one memory of the conversation. ## Where they earn their keep - Math, formal proofs, and competitive programming. - Multi-step debugging and refactoring across many files. - Strategic / planning tasks where wrong-direction-early compounds. - Tasks with verifiable answers (the regime RL post-training is built for). ## Where they don't - Latency-sensitive interactive UX — the thinking trace destroys time-to-first-token. - Open-ended creative writing — reasoning models often over-correct toward "safe" outputs. - Cheap classification, extraction, summarisation — a non-thinking pass is faster and just as accurate. - Anything where the user pays per-token without per-answer value being measured. The right default is **non-thinking by default, thinking when verifiability or planning depth justifies the cost**. Pretending every request needs reasoning is a fast path to a bloated [[AI Cost Management|AI bill]]. ## References - ## Related - [[Large Language Models (LLMs)]] - [[Chain-of-Thought (CoT) prompting]] - [[AI Scaling Laws]] - [[AI Cost Management]] - [[AI Inference]] - [[AI Frontier Model]] - [[DeepSeek v4]] - [[Claude]] - [[Mistral Small 4]]