AI Verifiability as a Capability Ceiling

# AI Verifiability as a Capability Ceiling The jaggedness of [[Large Language Models (LLMs)|LLM]] capabilities (the same model that coherently refactors a 100,000-line codebase will also confidently tell you to walk to a car wash to wash your car) is explained by *four* compounding factors, not one. [[Andrej Karpathy]] originally framed this as a [[AI Verifiability|verifiability]] question. At Sequoia AI Ascent 2026 he made the formula explicit: > Capability ≈ **Verifiability × Training Attention × Data Coverage × Economic Value** You are either inside the circuits the frontier labs chose to package into RL, or you are off-roading in the jungle with a machete. ## The Four Factors ### 1. Verifiability (the technical lever) Tasks with cheap, unambiguous correctness checks generate strong RL signal. Tasks without them do not. The lab can pour compute into a verifiable domain (coding, math, structured extraction) and watch capability climb predictably. In unverifiable domains, throwing compute at the problem returns diminishing slop. This is the substrate that makes RLVR (Reinforcement Learning from Verifiable Rewards) work at all. See [[AI Verifiability]] for the full treatment. ### 2. Training Attention (the lab-focus lever) Even where verifiability exists, the lab must choose to invest. Capability follows where the post-training team wires up reward environments, runs RL loops, and tunes evaluations. Verifiability is the *necessary* condition; training attention is the *sufficient* one. A verifiable domain that no lab targeted is a domain where the model is competent but not exceptional. ### 3. Data Coverage (the pretraining lever) If a domain isn't represented in the pretraining mixture, no amount of post-training fixes it. Coverage decides whether the latent capability exists for RL to unlock. Karpathy's `microGPT` example is illustrative: models "hate" simplifying it because the simplification pattern is rare in the corpus, so even capable models drift back toward verbose, idiomatic code. ### 4. Economic Value (the market lever) Frontier labs allocate training resources by expected revenue and TAM. The training data distribution is therefore not a neutral sampling of human knowledge; it is a *strategic asset selection*. Domains with high enterprise willingness-to-pay (coding, customer support, sales enablement, data analysis, legal drafting) get aggressive RL investment. Long-tail domains do not. This means capability is not driven by what is possible in principle for the model architecture; it is driven by what is *profitable* to train. ## Why This Produces Jaggedness Multiplying the factors creates discontinuous capability cliffs between tasks that *look* adjacent to humans: - **All four high**: capability is on rails. Coding, code review, SQL, structured extraction. These tasks feel like flying. - **High verifiability, low training attention**: capability is competent but not state of the art. Niche scientific domains with verifiable answers but small markets that no lab prioritized. - **High verifiability + economic value, low data coverage**: the lab wants to ship it but the underlying knowledge isn't in the pretraining mixture. Capability lags despite investment. - **Low verifiability + high economic value**: capability is *deceptively confident*. Strategic advice, content writing, customer-facing prose. The model is fluent, the lab pushed RLHF on it, but the underlying signal is noisy. - **All four low**: capability is wild. Everyday commonsense reasoning, hyperlocal knowledge, unfashionable subjects. Off-roading with a machete. This is why a single model can be simultaneously superhuman at one task and embarrassingly bad at the adjacent task. The two tasks may differ only in their position on the four-factor grid. ## Practical Consequences - **Choose your application domain by the grid**, not by the demo. A demo always shows the on-rails case. - **Wrap unverifiable tasks in verifiable scaffolds**; convert prose questions to structured ones, add evaluators, force citations. - **Be wary of capability transfer claims**; the model's strength on one task tells you almost nothing about the adjacent task if they are on different sides of the grid. - **Watch the lab roadmap**, not just the benchmarks. New RL investment in a domain shifts that domain from off-road to on-rails within a release cycle. - **Build moats in low-TAM, low-verifiability niches**; the frontier labs will not enter them, and the model alone is not enough. ## Karpathy's Framing Karpathy himself notes the model is incomplete. The verifiability + economics story is the best current explanation for jaggedness, but it doesn't account for everything. The honest answer is: building an accurate mental model of LLM capabilities is itself an ongoing engineering discipline. The jaggedness is not a bug to be fixed in the next release; it is a structural feature of how frontier capabilities get manufactured. ## Source - [[Andrej Karpathy]] expanded this argument publicly in May 2026, building on his earlier verifiability writing. ## Related - [[AI Verifiability]] - [[Andrej Karpathy]] - [[Large Language Models (LLMs)]] - [[Reinforcement Learning From Human Feedback (RLHF)]] - [[AI Reasoning Models]] - [[Agentic Engineering]] - [[Agent-Native Product Decomposition]] - [[Menugen Architecture Pattern]] - [[LLM Knowledge Bases Over Unstructured Data]] - [[Markdown-based Installation (MD Scripts)]]