# AI Verifiability
Verifiability is the property of a domain or task where outputs can be checked against ground truth cheaply and unambiguously. It is the single most predictive variable for whether an [[Large Language Models (LLMs)|LLM]] will perform reliably on a class of work. High-verifiability domains are LLM-friendly. Low-verifiability domains are LLM-hostile.
## The Spectrum
| Verifiability | Examples | LLM Reliability |
|---|---|---|
| Hard-checkable | code execution, unit tests, formal proofs, math, SQL | High |
| Soft-checkable | translation against reference, structured extraction | Medium |
| Subjective | writing quality, taste, ethics, strategic judgment | Low |
| Unverifiable in context | "is this commonsense advice correct?", "should I move?" | Lowest |
## Why It Matters
Verifiability shapes three things at once:
1. **Training**: the [[Reinforcement Learning From Human Feedback (RLHF)|RL]] loop needs a reward signal. Verifiable tasks generate cheap, abundant, high-quality signal. Unverifiable tasks need expensive human raters with disputed judgments.
2. **Inference-time correction**: in verifiable domains, the model can run, fail, observe, retry. This is the entire premise of [[Agentic Engineering]] (write code, run tests, fix). In unverifiable domains, the model has no error signal and confidently outputs the first plausible answer.
3. **Trust**: users learn quickly to distrust LLMs in unverifiable domains because they cannot tell when the model is wrong; the failure surface is invisible.
## Engineering Implication
Design LLM-powered systems so the work happens in verifiable domains whenever possible. This is the deeper reason why coding agents work, why structured output extraction works, why SQL generation works; and why "advise me on a life decision" does not.
When the underlying task is unverifiable, build a verifiable wrapper:
- Force structured output that can be schema-validated.
- Decompose into sub-tasks where each step can be checked.
- Add human-in-the-loop checkpoints at the unverifiable boundaries.
- Use a second LLM as an evaluator only when its judgments are themselves grounded in verifiable criteria.
## The Capability Ceiling
Verifiability is one of the two primary causes of jagged LLM capability. The other is economic; frontier labs invest training compute where TAM justifies it. Together they produce a capability landscape that looks like uneven terrain rather than a smooth frontier. See [[AI Verifiability as a Capability Ceiling]] for the full argument.
## Related
- [[AI Verifiability as a Capability Ceiling]]
- [[Andrej Karpathy]]
- [[Large Language Models (LLMs)]]
- [[Agentic Engineering]]
- [[Reinforcement Learning From Human Feedback (RLHF)]]
- [[AI Reasoning Models]]
- [[AI and Trust]]
- [[Menugen Architecture Pattern]]
- [[LLM Knowledge Bases Over Unstructured Data]]