Loop Engineering - DeveloPassion

# Loop Engineering ## What it is Loop engineering is the discipline of designing the system that drives an autonomous agent, instead of prompting it step by step. [[Simon Willison]]'s definition of an LLM agent is sharp: "something that runs tools in a loop to achieve a goal." The art, he says, is carefully designing the tools and the loop. Loop engineering is where that art lives. The shift is about what you spend your time on. Prompt engineering asks: what do I say to the model right now? Loop engineering asks: what is the trigger, what are the tools, what counts as done, and what stops the agent if it isn't? You author the system once. The agent runs. This sits at the top of a layered stack. [[Prompt Engineering]] is inside [[Context Engineering]] is inside [[Harness Engineering]] is inside loop engineering. You still write prompts; you still curate context. Loop engineering is the layer that puts it all in motion. ## Why it matters now The bottleneck moved. For years, the constraint in AI-assisted work was the model itself: too short a context window, too many reasoning failures, too much recovery needed from the human. That changed around 2025-2026. METR measured it concretely: Claude Opus 4.6 now completes 50% of tasks that take roughly 12 hours, up from roughly 1 hour 40 minutes a year earlier. The model can run long. It can recover from its own mistakes. The question is no longer "can the model do this?" but "have I designed the loop well enough that it will?" The most provocative data point on this comes from Terminal-Bench 2.0. The same model swings 30-50 percentage points on benchmark performance depending on which harness is running it. Claude Code vs. OpenHands vs. a homegrown loop, same underlying model, radically different results. When someone tells you "model X is best for agents," the right question is: which harness? [[Boris Cherny]], who built Claude Code: "I don't prompt Claude anymore. My job is to create loops." [[Peter Steinberger]], who built OpenClaw, makes the same argument: stop prompting coding agents; design the loops that prompt them. [[Jensen Huang]] said something similar: "Nobody writes prompts anymore. The new job is to write and handle loops." It was Cherny's and Steinberger's posts going viral that pushed the phrase into the mainstream; by late June 2026 even [[Andrew Ng]] was writing about his own loops in The Batch. That is not hype. That is where the leverage is. ## How it works A well-designed loop has six parts: **Trigger** — what starts the loop. This can be a file change, a cron job, a human command, an API event, or the completion of another loop. The trigger determines scope and timing. **Tools** — what the agent can call. Shell commands, file operations, APIs, search, sub-agents. Tool design matters enormously. Willison's framing: it's not just about designing the loop, it's about carefully designing the tools the loop runs. Dangerous or overly broad tools in the loop are a safety problem, not just a design smell. **Goal** — a clear, verifiable target state. "Make the tests pass and the type-checker clean" is a goal. "Help me with the codebase" is not. The goal needs to be testable by something other than the agent's own judgment. **Verifier** — a deterministic check that runs when the agent reports completion. This is the most important part of the loop and the most skipped. The agent stops when it *feels* done. The verifier checks whether it actually is (see [[AI Verifiability]]; this is also why verifiable domains get the biggest agent gains). [[Claude Code Hooks]] give you exactly this: a hook that intercepts the agent's exit signal, runs real completion criteria (tests green, coverage threshold met, type-check clean), and reinjects the goal if the criteria are not met. Trust the verifier, never the agent's self-report. **Stopping condition** — separate from the verifier. This fires when the loop should stop regardless of completion: max iterations reached, cost ceiling hit, a specific error type detected, a human-review flag raised. Without an explicit stopping condition, a loop that hits a bad state can spiral expensively. **Memory** — the durable spine outside any conversation (see [[AI Agent Memory]]). Every run starts from scratch unless the loop records what happened: run history, lessons learned, current state. Without memory, the loop keeps paying tokens to rediscover the same problems. [[Addy Osmani]] said it best: "The agent forgets, the repo doesn't." A markdown file is enough. Anthropic's own guidance for long-running agents says the same: give the agent a place to write notes. The agent then runs: act, observe, decide, repeat. ReAct (Yao et al., 2022) formalized this structure. Reflexion (Shinn et al., 2023) added self-correction via verbal feedback. The lineage is well-established. A widely shared Google paper on loop engineering shows the cycle in its purest form, applied to compiler optimization: the LLM proposes a code transformation, the compiler runs it and reports back (valid? faster? by how much?), the model reads that feedback and adjusts its next move, and the cycle repeats until it stops finding improvements. The agent gets better purely from grounded feedback inside its own context window. No fine-tuning. Just a tight loop with a source of truth at the bottom. ### Loops nest [[Andrew Ng]] frames product building as three loops running at different speeds, each feeding the next: - **Agentic coding loop** (minutes): spec and evals in, working code out. The agent writes, tests its own work, and iterates until the code meets the spec. Ng built a typing app for his daughter this way; the agent worked for about an hour, checking its output in a web browser several times, without needing him once. - **Developer feedback loop** (tens of minutes to hours): the developer reviews the product and steers. A year ago developers were the QA function for their agents. Now that agents test their own code, the human role moved up to product decisions: which features, where the UI fails, what the spec should say. - **External feedback loop** (hours to weeks): friends, alpha testers, A/B tests. Slow, but this is what feeds the developer's vision, which drives the spec, which drives the coding agent. The inner loop is fast and cheap. The outer loops are slow and carry the judgment. Loop engineering is designing all three and knowing which one you're standing in. ### Three shapes of loop [[Matt Van Horn]]'s distinction, and the one almost everyone trips on. Three shapes, three different jobs: - **Goal**: run until a verifiable condition is true, then stop. "Fix it until the tests pass." A separate model checks completion after every turn. - **Interval loop**: repeat on a timer while you're present. "Every 5 minutes, check the deploy." - **Routine**: run on a schedule while you're gone. "Every night, review my open PRs." Getting the shape right matters because tooling maps to it directly ([[Claude Code]]: `/goal`, `/loop`, `/schedule`). Pick the wrong shape and the loop either never stops or never starts. ## Recommendations **Write the verifier before the loop.** If you cannot write a deterministic check for "done," your goal is not specific enough. Tighten the goal first. **Read about premature stopping before you ship anything.** This is the central failure mode: the agent halts when its subjective confidence is high, not when the task is actually finished. Every serious loop needs a verifier that can say no. **Build loops out of battle-tested skills.** Austin Marchese calls this skill-driven loop development, and it's the right order of operations: never wire a loop around instructions you haven't already run by hand and refined. [[AI Agent Skills]] are the natural building blocks here; a skill you've battle-tested knows how you want the task done. A loop built on top of it inherits that. A loop built on a vague prompt inherits the vagueness, then repeats it autonomously. **Roll out in phases: report, assist, then unattended.** Week one, the loop only reports what it would do. Then it proposes fixes you approve. Only after that does it run unattended. The cobusgreyling/loop-engineering repo bakes this L1 → L2 → L3 progression into every pattern it ships. Same idea at the micro level: keep the first runs of any new loop in training mode, pausing at each step for your approval, until you've seen it do what you actually meant. **Scale with subagents in isolated worktrees.** The main loop decomposes the task, spawns [[AI Subagents]] in isolated worktrees (each with its own context window, model tier, and permissions), collects results, and decides what to do next. This protects your main context window and lets you route cheap subtasks to cheaper models. **Scope permissions and sandbox before you run.** [[AI Agent Permissions]] are part of the loop design, not an afterthought. Willison's warning: "An AI agent is an LLM wrecking its environment in a loop." YOLO mode (auto-approve on all shell commands) is where real productivity is and also where the real danger is: bad shell commands, secret exfiltration, the machine used as a proxy for attacks. Define what the loop can touch before you start, not after something goes wrong. **Route models by subtask, not uniformly.** A frontier model for planning and reasoning; a cheaper, faster model for mechanical subtasks. The harness manages this. The goal is accuracy per dollar, not the best model everywhere. **Put a different model family in the checker seat.** An agent grading its own homework will delete the failing test and call it done. A separate verifier model helps; a verifier from a *different* model family helps more, because it doesn't share the worker's blind spots. The Clodex pattern (Codex reviewing Claude's pull requests before merge, capped at 5 iterations) is the cleanest example: two model families have to agree before code lands. **Watch cost.** Errors compound in loops. A bad state in iteration 3 becomes a worse state in iteration 8 if the verifier doesn't catch it. And the bills are not theoretical: Uber capped its engineers at $1,500 per AI tool per month after burning through its annual AI budget in four months, and one Reddit user torched around $6,000 overnight with a single unbounded command. Every goal gets a budget; every loop gets a cap. Set the ceiling before you walk away, not after the invoice arrives. Log everything. ## Tips and tricks **Run the four-condition test before building any loop.** Does the task repeat? Is there a clear definition of done? Can you afford the tokens if it wanders? Does the loop have the tools to verify its own work? Four yeses make a loop candidate. Anything else stays a prompt. **Stop hooks are the most underused primitive.** [[Claude Code Hooks]] let you intercept exits deterministically. A stop hook that rejects agent self-reports and runs your own checks is not a nice-to-have; it's what makes the loop trustworthy. **The harness outweighs the model.** Terminal-Bench 2.0 showed 30-50 point swings. Design the harness first; choose the model second. **Loops only pay off with a strict validation gate.** Without one, you get an agent agreeing with itself on repeat. That is not autonomous work; that is expensive noise. Or, as one practitioner put it during the June 2026 wave: a loop that cannot tell good output from bad just automates being wrong, faster. **One green run is luck. A streak is reliability.** Don't stop at the first clean pass. The quality-streak pattern only declares victory after N consecutive clean runs, and any new failure resets the count. This respects how flaky "it works" really is. **Add anti-spin stops.** Most loops never ask whether they are actually making progress; they retry the same broken approach, or quietly edit the test until it passes. No-progress detection, retry caps, and flip-flop detection (the loop alternating between two approaches) are cheap to add and catch exactly this. **If human review is your bottleneck, a loop just floods the queue.** Measure where the actual constraint is before adding automation. Loops move throughput; they do not improve quality gates. **Do not let permission creep happen.** Scope at design time. Once an agent has broad shell access, you are trusting every tool call it makes, forever. **Context discipline matters inside the loop.** Longer context windows per subagent are not free. [[Context Engineering]] (what you put in, what you leave out, what you refresh) is still a skill, just applied at the loop level instead of the prompt level. **"Taste" is a context advantage, not magic.** [[Andrew Ng]]'s reframing of why humans stay in the loop: for nearly every product, the human knows far more about the users and the operating context than the AI does. Human-in-the-loop is how that knowledge gets injected into the system. So long as you know something the agent doesn't, your review step isn't ceremony; it's the highest-bandwidth input the loop has. You move up a loop, not out of the loop. **A loop is not automatically a control system.** The sharpest pushback on the trend came from the control-theory crowd: if one stochastic component generates output and another stochastic component reviews it, you may just have a faster stochastic loop with better branding. Recursion with a dashboard. Deterministic anchors, measurable error signals, bounded failure modes, and an accountable human owner are what turn a loop into control. Vitalii Oborskyi's summary under Ng's post nails it: the industry is not short of loops; it is short of control. **The "Brute Squad" framing is useful.** Sourcegraph described agentic coding as brute-force autonomous agents. That is an honest description. Loops are not elegant; they are persistent. The value is iteration speed, not elegance. **plentysun's pattern (Claude Code features):** context discipline plus hooks that force steps. The hook is not optional scaffolding; it is the loop's backbone. ## References ### Foundational - Simon Willison, "Designing agentic loops" (2025-09-30) — https://simonwillison.net/2025/Sep/30/designing-agentic-loops/ - Anthropic, "Building Effective Agents" (2024-12) — https://www.anthropic.com/news/building-effective-agents - Anthropic, "Effective harnesses for long-running agents" — https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents - Anthropic, "Effective context engineering for AI agents" (HN discussion) — https://news.ycombinator.com/item?id=45418251 - ReAct (Yao et al., 2022) — https://arxiv.org/abs/2210.03629 - Reflexion (Shinn et al., 2023) — https://arxiv.org/abs/2303.11366 ### 2026 commentary - Andrew Ng, "3 key loops for building 0-to-1 products", The Batch issue 359 (2026-06-26) — https://www.deeplearning.ai/the-batch/issue-359 (LinkedIn version: https://www.linkedin.com/posts/andrewyng_loop-engineering-is-a-hot-buzzphrase-after-share-7477753882505338880-dBJ-/) - Matt Van Horn, "WTF Is a Loop? Part 2: The 15 Loops People Are Actually Running" (2026-06-20) — https://www.linkedin.com/pulse/wtf-loop-part-2-15-loops-people-actually-running-steal-matt-van-horn-xgkkc/ - Austin Marchese, "Stop Prompting Claude. Start Loop Engineering." (YouTube, 2026-06-19) — https://www.youtube.com/watch?v=YAS4ojuhbW4 - Movez on X (summary of the Google 19-page loop engineering PDF: act → observe → learn → repeat) — https://x.com/0xMovez/status/2069500921382326531 - Kent C. Dodds on X ("I've been loop engineering for months", with video) — https://x.com/kentcdodds/status/2069510257525874923 - Data Science Dojo, "Agentic Loops: From ReAct to Loop Engineering (2026 Guide)" — https://datasciencedojo.com/blog/agentic-loops-explained-from-react-to-loop-engineering-2026-guide/ - bdtechtalks, "Demystifying loop engineering" (2026-06-22) — https://bdtechtalks.com/2026/06/22/ai-loop-engineering/ - Requesty, "Loop Engineering: How to Build AI Agent Loops That Run Themselves" — https://www.requesty.ai/blog/loop-engineering-how-to-build-ai-agent-loops-that-run-themselves - Augment Code, "Agentic Design Patterns (2026 Pattern Catalog)" — https://www.augmentcode.com/guides/agentic-design-patterns - Sourcegraph, "The Brute Squad" — https://sourcegraph.com (Readwise highlight) ### Curated lists - cobusgreyling/loop-engineering (7 production patterns, starters, and the loop-audit / loop-init / loop-cost / loop-sync / loop-context CLIs; five building blocks + memory) — https://github.com/cobusgreyling/loop-engineering - serenakeyitan/awesome-agent-loops — https://github.com/serenakeyitan/awesome-agent-loops - Picrew/awesome-agent-harness — https://github.com/Picrew/awesome-agent-harness - RyanAlberts/best-of-Agent-Harnesses — https://github.com/RyanAlberts/best-of-Agent-Harnesses - ai-boost/awesome-harness-engineering — https://github.com/ai-boost/awesome-harness-engineering ### Tools and repos - earendil-works/pi — https://github.com/earendil-works/pi - snarktank/ralph (the [[Ralph Loop]]) — https://github.com/snarktank/ralph - the-open-engine/zeroshot — https://github.com/the-open-engine/zeroshot ### Talks and threads - Louis Bouchard, "Loop Engineering Explained" (YouTube) — https://www.youtube.com/watch?v=NjXIIH9vcv0 - HN: "The unreasonable effectiveness of an LLM agent loop with tool use" — https://news.ycombinator.com/item?id=43998472 - HN: "Designing agentic loops" — https://news.ycombinator.com/item?id=45426680 - Paweł Huryn on X (Cherny / Huang quotes) — https://x.com/PawelHuryn/status/2069315068664197315 - Graham Neubig on X (his agent loop) — https://x.com/gneubig/status/2064011013637234728 ## Related [[Harness Engineering]] · [[Agentic loops]] · [[How Coding Agents Work]] · [[AI Agent Harnesses (MoC)]] · [[Agentic Engineering]] · [[AI Guardrails]] · [[Claude Code]] · [[Feedback Loop]] · [[Levels of AI use]] · [[AI Agent Skills]] · [[AI Skill Best Practices]] · [[AI Verifiability as a Capability Ceiling]] · [[Ralph Loop]] · [[AI Agent Memory]] · [[AI Agent Orchestration]] · [[Claude Code Hooks]] · [[Goal Engineering]] · [[Comprehension Debt]]