# Webwright Open-source web-agent framework from [[Microsoft Research]] + University of Hong Kong (Yadong Lu, Lingrui Xu, Chao Huang, Ahmed Awadallah; published 2026-05-04). The thesis sits in the subtitle of the paper: *a terminal is all you need for web agents*. Instead of predicting one click at a time inside a persistent browser session, the agent writes [[Playwright]] and bash scripts in a terminal sandbox, and treats the workspace (not the browser) as the source of truth. MIT-licensed, ~1K lines of core code. ## The argument Engineered web-agent harnesses (DOM selectors, coordinate prediction, fixed step-by-step loops) become *less* useful as models get better at code and reasoning. They constrain a capability the model already has. Webwright drops the harness: give the agent a shell, let it write code, observe output, iterate. Three things it claims to fix: - **Fragility**: pixel-level interactions break across UI variations - **Inefficiency**: step-by-step click prediction needs many turns for tasks that could be a 20-line script - **Non-reusability**: solutions can't be packaged or shared; every task starts from zero ## Architecture Three modules, ~1,000 lines total: 1. **Runner** — sends task context + workspace state to the model 2. **Model endpoint** — returns thinking blocks and shell commands (typically Playwright-backed Python) 3. **Environment** — executes commands, returns terminal output, screenshots, logs, or errors Loop: send context → emit bash → return observations → refine or finish. Code split: core agent loop ~450 lines, Playwright environment ~570 lines, CLI ~150 lines. ## Key mechanisms - **Code as the action space** — Playwright scripts, not coordinate guesses or DOM selectors - **Disposable browsers** — spawn and discard sessions; no state entanglement across tasks - **Multi-step composition** — chain many web interactions inside one model step - **Workspace persistence** — scripts, logs, screenshots survive after the agent exits, inspectable + reusable - **Self-verification** — agent generates a final script + self-reflection config, reruns it in a fresh folder, and judges pass/fail before marking the task done - **Context compaction** — long trajectories get summarised while concrete artifacts (files, screenshots) are kept verbatim ## Benchmark results (paper) - **Online-Mind2Web** (300 live tasks, 136 sites): 86.7% with GPT-5.4 + Webwright at N=100 steps. Easy 96.2%, Medium 88.1%, Hard 76.6%. [[Claude Opus 4.7]] hits 84.7% overall but is stronger on Hard (80.5%). - **Odysseys** (200 long-horizon tasks, ~76 avg steps): 60.1% with GPT-5.4 — a 35.1 pp lift over the prior SOTA (Opus 4.6 at 44.5%) and 79.4% relative improvement over base GPT-5.4 (33.5%). - **Small model lift**: [[Qwen]]-3.5-9B reaches 66.2% on Online-Mind2Web hard split when augmented with Webwright's tool selection. The harness genuinely amplifies cheap models, not just frontier ones. - **Cost**: $2.37/task with GPT-5.4 vs $6.09/task with Opus 4.7 (Opus uses fewer steps but is priced higher). First 50 steps deliver ~82% accuracy; the next 50 add only 3-4 pp — diminishing returns are clear. ## Reusability angle A completed Webwright script can be packaged as a reusable CLI with arguments and shared across harnesses — [[Codex CLI]], [[Claude Code]], [[OpenClaw]], [[Hermes Agent]] — via the shared `skills/webwright/` skill manifest (compatible with the [[AI Agent Skills]] / `agentskills.io` standard). That makes Webwright not just an agent but a *script forge*: the artifact, not the trajectory, is what gets reused. ## Install & quickstart Two usage paths: a standalone CLI (Webwright drives its own LLM loop) or a Claude Code / Codex plugin (the host agent drives the loop natively, no extra API key). Requirements: Python 3.10+, an isolated virtual environment (e.g. `uv venv`, `python -m venv`, `conda`), an API key for the chosen backend if running the standalone CLI. The repo must be cloned locally for either path (`pip install -e .` is editable-install against the source tree). ```bash git clone https://github.com/microsoft/Webwright.git cd Webwright uv venv && source .venv/bin/activate # or python -m venv .venv && source .venv/bin/activate uv pip install -e . # or pip install -e . ``` Browser binary depends on usage mode: - Standalone CLI → `playwright install chromium` - Plugin (Claude Code / Codex) → `playwright install firefox`. The skill forbids Chromium because some sites block its TLS fingerprint with `ERR_HTTP2_PROTOCOL_ERROR`. Standalone CLI run: ```bash export OPENAI_API_KEY=... # or ANTHROPIC_API_KEY / OPENROUTER_API_KEY python -m webwright.run.cli \ -c base.yaml -c model_openai.yaml \ -t "Search for flights from SEA to JFK on 2026-08-15 to 2026-08-20" \ --start-url https://www.google.com/flights \ --task-id demo_openai \ -o outputs/default ``` Plugin install (from inside Claude Code): ``` /plugin marketplace add microsoft/Webwright /plugin install webwright@webwright ``` Then use `/webwright:run <task>` (one-shot) or `/webwright:craft <task>` (reusable CLI tool). Backends (CLI mode): [[OpenAI]], [[Anthropic]], OpenRouter. Core deps: Playwright, httpx, Pydantic, Typer. ## Accessibility synergy (paper sidenote) The authors argue web accessibility infrastructure (ARIA metadata, semantic page representations) now benefits both assistive technology *and* LLM agents, and that agents can in turn act as a "repair layer" for inaccessible pages. Worth tracking. ## Why it matters Two claims worth keeping: 1. **Less harness, more model**: as base models improve, structural scaffolding that constrains them becomes a tax, not a help. Webwright is the cleanest demonstration of this trend in the web-agent space. 2. **Artifact-first agents**: when the output is a reusable script in a persistent workspace, the agent stops being a black box of clicks and becomes a contributor to a personal codebase. Fits with skill-based harnesses like [[Hermes Agent]] and [[Claude Code]]. ## References - Project page: https://microsoft.github.io/Webwright/ - GitHub: https://github.com/microsoft/Webwright - Microsoft Research article: https://www.microsoft.com/en-us/research/articles/webwright-a-terminal-is-all-you-need-for-web-agents/ ## Related - [[Playwright]] - [[Browser Use CLI]] - [[Scrapling]] — stealth scraping contrast (when the target actively fights bots, evasion beats agency) - [[Claude Computer use]] — pixel-level agent contrast (Webwright bets on code-as-action instead) - [[AI Agent Skills]] - [[Claude Code]] - [[Codex CLI]] - [[OpenClaw]] - [[Hermes Agent]] - [[OpenHands]] - [[Claude Opus 4.7]] - [[Qwen]]