Microsoft Research Releases Webwright: A Terminal-Native Web Agent Framework That Scores 60.1% on Odysseys, Up from Base GPT-5.4’s 33.5%
Most internet brokers as we speak drive a browser one motion at a time. The mannequin receives the present web page state — as a screenshot or DOM textual content — and predicts the subsequent click on, keypress, or scroll. This action-at-a-time design made sense when language fashions had restricted reasoning capacity. As fashions have change into extra succesful at writing and debugging code, that inflexible loop has change into a constraint slightly than a construction that helps.
Microsoft Research’s AI Frontiers lab constructed a special strategy. Their new open-source framework, Webwright, provides the agent a terminal as a substitute of a stateful browser session. The agent writes Playwright code to regulate browsers, runs bash instructions, inspects logs, and iteratively refines scripts. Playwright is an open-source browser automation library, additionally from Microsoft, that helps programmatic management of Chromium, Firefox, and WebPackage browsers.
What Webwright Does Differently
Webwright separates the agent from the browser and treats the browser as one thing the agent can launch, examine, and discard whereas growing a program. The persistent artifact is just not the browser session however the code and logs within the native workspace.
This is similar mannequin a developer makes use of when writing an RPA (Robotic Process Automation) script. Instead of manually clicking by a website every time, they write a script as soon as. That script may be rerun, tailored, and shared. Webwright applies this to LLM-powered brokers.
The system has three core elements: a Runner, a Model Endpoint, and a terminal Environment. The runner is about 150 traces of code, the mannequin interface about 550 traces, and the atmosphere about 300 traces. There is not any multi-agent orchestration or advanced planning hierarchy — only a single agent loop.
All intermediate code, logs, screenshots, and outcomes are saved within the workspace, making every run straightforward to examine.

The Agent Loop
The Runner sends the present context to the mannequin. The mannequin returns a pondering block and a shell command. That command runs within the Environment, which returns terminal output, logs, screenshots, or error tracebacks. These observations return into context, and the loop continues.
Rather than issuing one primitive motion at a time, a coding agent can naturally categorical multi-step interactions — comparable to choosing a date or filling out a complete type — as a compact program. Loops, capabilities, and abstractions permit the agent to generalize throughout related duties with out repeatedly predicting related sequences of low-level steps.
Two Engineering Challenges
Premature ‘performed’ and context explosion are the 2 core points. With open-ended bash actions, the mannequin should self-report completion and sometimes claims success with out really ending. They added a gate: the agent should generate a self-reflection config, run a ultimate script in a recent folder with logs and screenshots, and cross its personal self-reflection judgement that outputs success or failure earlier than emitting performed: true. Otherwise, the flag is dropped and it retries.
For context size, lengthy coding trajectories rapidly exceed context limits, in order that they compact historical past each 20 steps right into a single abstract.
Benchmark Results
Webwright was evaluated on two benchmarks: Online-Mind2Web and Odysseys.
Online-Mind2Web incorporates 300 duties throughout 136 broadly used websites and makes use of an automatic LLM-as-a-Judge analysis framework. GPT-5.4 achieves 86.67% total accuracy, representing the best amongst all open-sourced harness recipes within the AutoEval class of the Online-Mind2Web benchmark, with a 100-step price range. Claude Opus 4.7 reached 84.7% total however carried out higher on laborious duties at N=100 steps — 80.5% versus 76.6% for GPT-5.4.
They additionally reproduced a GPT-5.4 baseline in a standard screenshot-based agent setting, the place the mannequin predicts x,y coordinates for clicks and typing actions. Using the identical underlying mannequin, Webwright achieves substantial positive factors throughout all three issue classes, highlighting the advantage of the code-driven terminal-based strategy over step-by-step coordinate prediction.
Odysseys evaluates long-horizon looking duties spanning a number of web sites. Tasks common 272.3 phrases of directions. In the April 2026 leaderboard, the best-performing mannequin was Opus 4.6, with a high rating of 44.5. Webwright powered by GPT-5.4 reaches 60.1%, a 35.1% relative enchancment over the earlier state-of-the-art. Compared to the bottom GPT-5.4 efficiency of 33.5%, this corresponds to a 79.4% relative enchancment — or 26.6 absolute factors.
Cost Analysis
Claude Opus 4.7 is extra environment friendly within the variety of steps to resolve every job (imply 21.9 steps) in comparison with GPT-5.4 (imply 26.3 steps). However, Claude Opus 4.7 is priced considerably greater in comparison with GPT-5.4 ($5 vs. $2.50 per 1M enter tokens, and $25 vs. $15.00 per 1M output tokens, April 2026), which makes the common per-task price greater in comparison with GPT-5.4 ($2.37 vs. $6.09). The first 50 steps ship 82% accuracy, and the subsequent 50 steps ship 3–4 extra factors.
Small Model Performance
The analysis staff additionally examined Qwen3.5-9B on the laborious cut up of Online-Mind2Web. When duties are augmented with pre-built reusable device scripts, Qwen3.5-9B achieves 66.2% on Online-Mind2Web web sites with greater than 5 instruments. This reveals that smaller, lower-cost fashions can deal with advanced internet duties when paired with a pre-built device library.
Marktechpost’s Visual Explainer
Quick Start Guide
What Is Webwright?
Webwright is an open-source, terminal-native internet agent framework from Microsoft Research. Instead of predicting one browser click on at a time, the agent writes Playwright code, runs bash instructions, and shops reusable scripts in an area workspace.
- ~1,000 traces of harness code throughout 3 modules — no hidden orchestration
- Single agent loop: Runner, Model Endpoint, and terminal Environment
- 86.7% on Online-Mind2Web | 60.1% on Odysseys with GPT-5.4
- Backends: OpenAI, Anthropic, OpenRouter
- Scripts reusable in Claude Code, Codex, OpenClaw
# GitHub repository
github.com/microsoft/Webwright
What You Need Before Installing
Confirm the next are prepared earlier than operating any set up instructions.
- Python 3.10+ — required minimal runtime
- Chromium — put in through Playwright within the subsequent step
- API key — OpenAI, Anthropic, or OpenRouter
- Git — to clone the repository
# Check your Python model python --version # Must return Python 3.10 or greater
Clone and Install Webwright
Clone the repo, set up in editable mode, then set up Chromium for Playwright browser management.
# 1. Clone the repository git clone https://github.com/microsoft/Webwright cd Webwright # 2. Install the bundle in editable mode pip set up -e . # 3. Install Chromium for Playwright playwright set up chromium
The -e flag means native supply edits apply instantly with out reinstalling.
Run Your First Web Task
Export your API key, then cross a job instruction and begin URL to the CLI.
# Export your key export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." # Run a job python -m webwright.run.cli -c base.yaml -c model_openai.yaml -t "Find most cost-effective economic system flight SEA to JFK on 2026-05-15" --start-url https://www.google.com/flights --task-id demo_openai -o outputs/default
| Flag | Description |
|---|---|
| -c | Config file from src/webwright/config/ — stackable |
| -t | Task instruction in plain English |
| –start-url | Initial URL for the browser session |
| –task-id | Output subfolder title |
| -o | Root output listing for logs and scripts |
Use Webwright as a Claude Code Skill
Webwright ships a built-in Claude Code talent. No separate LLM API key’s wanted past your Claude Code subscription. Claude Code reads PNG screenshots natively.
# Project-scoped (inside this repo solely) mkdir -p .claude/expertise .claude/instructions ln -s "$PWD/expertise/webwright" .claude/expertise/webwright ln -s "$PWD/expertise/webwright/instructions" .claude/instructions/webwright # User-scoped (all initiatives) mkdir -p ~/.claude/expertise ~/.claude/instructions ln -s "$PWD/expertise/webwright" ~/.claude/expertise/webwright ln -s "$PWD/expertise/webwright/instructions" ~/.claude/instructions/webwright
Restart Claude Code after putting in, then use slash instructions:
# One-shot job /webwright:run search Google Flights SEA to JFK 2026-05-15 # Reusable parameterized CLI device /webwright:craft search a ticket from LAX to SFO depart June 7
Key Takeaways
- Webwright makes use of a terminal loop the place the agent writes and runs Playwright code as a substitute of predicting one browser motion at a time.
- GPT-5.4 reached 86.7% on Online-Mind2Web (100-step price range) and 60.1% on Odysseys — 26.6 factors above the bottom GPT-5.4 rating of 33.5%.
- The harness is ~1,000 traces throughout three modules with no multi-agent orchestration.
- Qwen3.5-9B reached 66.2% on the laborious cut up of Online-Mind2Web when augmented with pre-built device scripts.
- Task scripts are packaged as reusable CLIs, shareable throughout Claude Code, Codex, and OpenClaw.
The submit Microsoft Research Releases Webwright: A Terminal-Native Web Agent Framework That Scores 60.1% on Odysseys, Up from Base GPT-5.4’s 33.5% appeared first on MarkTechPost.
