# Webwright
Open-source web-agent framework from [[Microsoft Research]] + University of Hong Kong (Yadong Lu, Lingrui Xu, Chao Huang, Ahmed Awadallah; published 2026-05-04). The thesis sits in the subtitle of the paper: *a terminal is all you need for web agents*. Instead of predicting one click at a time inside a persistent browser session, the agent writes [[Playwright]] and bash scripts in a terminal sandbox, and treats the workspace (not the browser) as the source of truth. MIT-licensed, ~1K lines of core code.
## The argument
Engineered web-agent harnesses (DOM selectors, coordinate prediction, fixed step-by-step loops) become *less* useful as models get better at code and reasoning. They constrain a capability the model already has. Webwright drops the harness: give the agent a shell, let it write code, observe output, iterate. Three things it claims to fix:
- **Fragility**: pixel-level interactions break across UI variations
- **Inefficiency**: step-by-step click prediction needs many turns for tasks that could be a 20-line script
- **Non-reusability**: solutions can't be packaged or shared; every task starts from zero
## Architecture
Three modules, ~1,000 lines total:
1. **Runner** — sends task context + workspace state to the model
2. **Model endpoint** — returns thinking blocks and shell commands (typically Playwright-backed Python)
3. **Environment** — executes commands, returns terminal output, screenshots, logs, or errors
Loop: send context → emit bash → return observations → refine or finish.
Code split: core agent loop ~450 lines, Playwright environment ~570 lines, CLI ~150 lines.
## Key mechanisms
- **Code as the action space** — Playwright scripts, not coordinate guesses or DOM selectors
- **Disposable browsers** — spawn and discard sessions; no state entanglement across tasks
- **Multi-step composition** — chain many web interactions inside one model step
- **Workspace persistence** — scripts, logs, screenshots survive after the agent exits, inspectable + reusable
- **Self-verification** — agent generates a final script + self-reflection config, reruns it in a fresh folder, and judges pass/fail before marking the task done
- **Context compaction** — long trajectories get summarised while concrete artifacts (files, screenshots) are kept verbatim
## Benchmark results (paper)
- **Online-Mind2Web** (300 live tasks, 136 sites): 86.7% with GPT-5.4 + Webwright at N=100 steps. Easy 96.2%, Medium 88.1%, Hard 76.6%. [[Claude Opus 4.7]] hits 84.7% overall but is stronger on Hard (80.5%).
- **Odysseys** (200 long-horizon tasks, ~76 avg steps): 60.1% with GPT-5.4 — a 35.1 pp lift over the prior SOTA (Opus 4.6 at 44.5%) and 79.4% relative improvement over base GPT-5.4 (33.5%).
- **Small model lift**: [[Qwen]]-3.5-9B reaches 66.2% on Online-Mind2Web hard split when augmented with Webwright's tool selection. The harness genuinely amplifies cheap models, not just frontier ones.
- **Cost**: $2.37/task with GPT-5.4 vs $6.09/task with Opus 4.7 (Opus uses fewer steps but is priced higher). First 50 steps deliver ~82% accuracy; the next 50 add only 3-4 pp — diminishing returns are clear.
## Reusability angle
A completed Webwright script can be packaged as a reusable CLI with arguments and shared across harnesses — [[Codex CLI]], [[Claude Code]], [[OpenClaw]], [[Hermes Agent]] — via the shared `skills/webwright/` skill manifest (compatible with the [[AI Agent Skills]] / `agentskills.io` standard). That makes Webwright not just an agent but a *script forge*: the artifact, not the trajectory, is what gets reused.
## Install & quickstart
Two usage paths: a standalone CLI (Webwright drives its own LLM loop) or a Claude Code / Codex plugin (the host agent drives the loop natively, no extra API key).
Requirements: Python 3.10+, an isolated virtual environment (e.g. `uv venv`, `python -m venv`, `conda`), an API key for the chosen backend if running the standalone CLI. The repo must be cloned locally for either path (`pip install -e .` is editable-install against the source tree).
```bash
git clone https://github.com/microsoft/Webwright.git
cd Webwright
uv venv && source .venv/bin/activate # or python -m venv .venv && source .venv/bin/activate
uv pip install -e . # or pip install -e .
```
Browser binary depends on usage mode:
- Standalone CLI → `playwright install chromium`
- Plugin (Claude Code / Codex) → `playwright install firefox`. The skill forbids Chromium because some sites block its TLS fingerprint with `ERR_HTTP2_PROTOCOL_ERROR`.
Standalone CLI run:
```bash
export OPENAI_API_KEY=... # or ANTHROPIC_API_KEY / OPENROUTER_API_KEY
python -m webwright.run.cli \
-c base.yaml -c model_openai.yaml \
-t "Search for flights from SEA to JFK on 2026-08-15 to 2026-08-20" \
--start-url https://www.google.com/flights \
--task-id demo_openai \
-o outputs/default
```
Plugin install (from inside Claude Code):
```
/plugin marketplace add microsoft/Webwright
/plugin install webwright@webwright
```
Then use `/webwright:run <task>` (one-shot) or `/webwright:craft <task>` (reusable CLI tool).
Backends (CLI mode): [[OpenAI]], [[Anthropic]], OpenRouter. Core deps: Playwright, httpx, Pydantic, Typer.
## Accessibility synergy (paper sidenote)
The authors argue web accessibility infrastructure (ARIA metadata, semantic page representations) now benefits both assistive technology *and* LLM agents, and that agents can in turn act as a "repair layer" for inaccessible pages. Worth tracking.
## Why it matters
Two claims worth keeping:
1. **Less harness, more model**: as base models improve, structural scaffolding that constrains them becomes a tax, not a help. Webwright is the cleanest demonstration of this trend in the web-agent space.
2. **Artifact-first agents**: when the output is a reusable script in a persistent workspace, the agent stops being a black box of clicks and becomes a contributor to a personal codebase. Fits with skill-based harnesses like [[Hermes Agent]] and [[Claude Code]].
## References
- Project page: https://microsoft.github.io/Webwright/
- GitHub: https://github.com/microsoft/Webwright
- Microsoft Research article: https://www.microsoft.com/en-us/research/articles/webwright-a-terminal-is-all-you-need-for-web-agents/
## Related
- [[Playwright]]
- [[Browser Use CLI]]
- [[Scrapling]] — stealth scraping contrast (when the target actively fights bots, evasion beats agency)
- [[Claude Computer use]] — pixel-level agent contrast (Webwright bets on code-as-action instead)
- [[AI Agent Skills]]
- [[Claude Code]]
- [[Codex CLI]]
- [[OpenClaw]]
- [[Hermes Agent]]
- [[OpenHands]]
- [[Claude Opus 4.7]]
- [[Qwen]]