Microsoft Research Releases Webwright: A Terminal-Native Web Agent Framework That Scores 60.1% on Odysseys, Up from Base GPT-5.4’s 33.5%

Most internet brokers as we speak drive a browser one motion at a time. The mannequin receives the present web page state — as a screenshot or DOM textual content — and predicts the subsequent click on, keypress, or scroll. This action-at-a-time design made sense when language fashions had restricted reasoning capacity. As fashions have change into extra succesful at writing and debugging code, that inflexible loop has change into a constraint slightly than a construction that helps.

Microsoft Research’s AI Frontiers lab constructed a special strategy. Their new open-source framework, Webwright, provides the agent a terminal as a substitute of a stateful browser session. The agent writes Playwright code to regulate browsers, runs bash instructions, inspects logs, and iteratively refines scripts. Playwright is an open-source browser automation library, additionally from Microsoft, that helps programmatic management of Chromium, Firefox, and WebPackage browsers.

What Webwright Does Differently

Webwright separates the agent from the browser and treats the browser as one thing the agent can launch, examine, and discard whereas growing a program. The persistent artifact is just not the browser session however the code and logs within the native workspace.

This is similar mannequin a developer makes use of when writing an RPA (Robotic Process Automation) script. Instead of manually clicking by a website every time, they write a script as soon as. That script may be rerun, tailored, and shared. Webwright applies this to LLM-powered brokers.

The system has three core elements: a Runner, a Model Endpoint, and a terminal Environment. The runner is about 150 traces of code, the mannequin interface about 550 traces, and the atmosphere about 300 traces. There is not any multi-agent orchestration or advanced planning hierarchy — only a single agent loop.

All intermediate code, logs, screenshots, and outcomes are saved within the workspace, making every run straightforward to examine.

https://www.microsoft.com/en-us/analysis/articles/webwright-a-terminal-is-all-you-need-for-web-agents/

The Agent Loop

The Runner sends the present context to the mannequin. The mannequin returns a pondering block and a shell command. That command runs within the Environment, which returns terminal output, logs, screenshots, or error tracebacks. These observations return into context, and the loop continues.

Rather than issuing one primitive motion at a time, a coding agent can naturally categorical multi-step interactions — comparable to choosing a date or filling out a complete type — as a compact program. Loops, capabilities, and abstractions permit the agent to generalize throughout related duties with out repeatedly predicting related sequences of low-level steps.

Two Engineering Challenges

Premature ‘performed’ and context explosion are the 2 core points. With open-ended bash actions, the mannequin should self-report completion and sometimes claims success with out really ending. They added a gate: the agent should generate a self-reflection config, run a ultimate script in a recent folder with logs and screenshots, and cross its personal self-reflection judgement that outputs success or failure earlier than emitting performed: true. Otherwise, the flag is dropped and it retries.

For context size, lengthy coding trajectories rapidly exceed context limits, in order that they compact historical past each 20 steps right into a single abstract.

Benchmark Results

Webwright was evaluated on two benchmarks: Online-Mind2Web and Odysseys.

Online-Mind2Web incorporates 300 duties throughout 136 broadly used websites and makes use of an automatic LLM-as-a-Judge analysis framework. GPT-5.4 achieves 86.67% total accuracy, representing the best amongst all open-sourced harness recipes within the AutoEval class of the Online-Mind2Web benchmark, with a 100-step price range. Claude Opus 4.7 reached 84.7% total however carried out higher on laborious duties at N=100 steps — 80.5% versus 76.6% for GPT-5.4.

They additionally reproduced a GPT-5.4 baseline in a standard screenshot-based agent setting, the place the mannequin predicts x,y coordinates for clicks and typing actions. Using the identical underlying mannequin, Webwright achieves substantial positive factors throughout all three issue classes, highlighting the advantage of the code-driven terminal-based strategy over step-by-step coordinate prediction.

Odysseys evaluates long-horizon looking duties spanning a number of web sites. Tasks common 272.3 phrases of directions. In the April 2026 leaderboard, the best-performing mannequin was Opus 4.6, with a high rating of 44.5. Webwright powered by GPT-5.4 reaches 60.1%, a 35.1% relative enchancment over the earlier state-of-the-art. Compared to the bottom GPT-5.4 efficiency of 33.5%, this corresponds to a 79.4% relative enchancment — or 26.6 absolute factors.

Cost Analysis

Claude Opus 4.7 is extra environment friendly within the variety of steps to resolve every job (imply 21.9 steps) in comparison with GPT-5.4 (imply 26.3 steps). However, Claude Opus 4.7 is priced considerably greater in comparison with GPT-5.4 ($5 vs. $2.50 per 1M enter tokens, and $25 vs. $15.00 per 1M output tokens, April 2026), which makes the common per-task price greater in comparison with GPT-5.4 ($2.37 vs. $6.09). The first 50 steps ship 82% accuracy, and the subsequent 50 steps ship 3–4 extra factors.

Small Model Performance

The analysis staff additionally examined Qwen3.5-9B on the laborious cut up of Online-Mind2Web. When duties are augmented with pre-built reusable device scripts, Qwen3.5-9B achieves 66.2% on Online-Mind2Web web sites with greater than 5 instruments. This reveals that smaller, lower-cost fashions can deal with advanced internet duties when paired with a pre-built device library.

Marktechpost’s Visual Explainer

Webwright
Quick Start Guide

01 / 05 — Overview
What Is Webwright?
Webwright is an open-source, terminal-native internet agent framework from Microsoft Research. Instead of predicting one browser click on at a time, the agent writes Playwright code, runs bash instructions, and shops reusable scripts in an area workspace.

~1,000 traces of harness code throughout 3 modules — no hidden orchestration
Single agent loop: Runner, Model Endpoint, and terminal Environment
86.7% on Online-Mind2Web | 60.1% on Odysseys with GPT-5.4
Backends: OpenAI, Anthropic, OpenRouter
Scripts reusable in Claude Code, Codex, OpenClaw

# GitHub repository
github.com/microsoft/Webwright

02 / 05 — Prerequisites
What You Need Before Installing
Confirm the next are prepared earlier than operating any set up instructions.

Python 3.10+ — required minimal runtime
Chromium — put in through Playwright within the subsequent step
API key — OpenAI, Anthropic, or OpenRouter
Git — to clone the repository

# Check your Python model
python --version
# Must return Python 3.10 or greater

03 / 05 — Installation
Clone and Install Webwright
Clone the repo, set up in editable mode, then set up Chromium for Playwright browser management.

# 1. Clone the repository
git clone https://github.com/microsoft/Webwright
cd Webwright

# 2. Install the bundle in editable mode
pip set up -e .

# 3. Install Chromium for Playwright
playwright set up chromium

The -e flag means native supply edits apply instantly with out reinstalling.

04 / 05 — Running a Task
Run Your First Web Task
Export your API key, then cross a job instruction and begin URL to the CLI.

# Export your key
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

# Run a job
python -m webwright.run.cli 
  -c base.yaml -c model_openai.yaml 
  -t "Find most cost-effective economic system flight SEA to JFK on 2026-05-15" 
  --start-url https://www.google.com/flights 
  --task-id demo_openai 
  -o outputs/default

Flag	Description
-c	Config file from src/webwright/config/ — stackable
-t	Task instruction in plain English
–start-url	Initial URL for the browser session
–task-id	Output subfolder title
-o	Root output listing for logs and scripts

05 / 05 — Claude Code Integration
Use Webwright as a Claude Code Skill
Webwright ships a built-in Claude Code talent. No separate LLM API key’s wanted past your Claude Code subscription. Claude Code reads PNG screenshots natively.

# Project-scoped (inside this repo solely)
mkdir -p .claude/expertise .claude/instructions
ln -s "$PWD/expertise/webwright" .claude/expertise/webwright
ln -s "$PWD/expertise/webwright/instructions" .claude/instructions/webwright

# User-scoped (all initiatives)
mkdir -p ~/.claude/expertise ~/.claude/instructions
ln -s "$PWD/expertise/webwright" ~/.claude/expertise/webwright
ln -s "$PWD/expertise/webwright/instructions" ~/.claude/instructions/webwright

Restart Claude Code after putting in, then use slash instructions:

# One-shot job
/webwright:run search Google Flights SEA to JFK 2026-05-15

# Reusable parameterized CLI device
/webwright:craft search a ticket from LAX to SFO depart June 7

Key Takeaways

Webwright makes use of a terminal loop the place the agent writes and runs Playwright code as a substitute of predicting one browser motion at a time.
GPT-5.4 reached 86.7% on Online-Mind2Web (100-step price range) and 60.1% on Odysseys — 26.6 factors above the bottom GPT-5.4 rating of 33.5%.
The harness is ~1,000 traces throughout three modules with no multi-agent orchestration.
Qwen3.5-9B reached 66.2% on the laborious cut up of Online-Mind2Web when augmented with pre-built device scripts.
Task scripts are packaged as reusable CLIs, shareable throughout Claude Code, Codex, and OpenClaw.

The submit Microsoft Research Releases Webwright: A Terminal-Native Web Agent Framework That Scores 60.1% on Odysseys, Up from Base GPT-5.4’s 33.5% appeared first on MarkTechPost.

Microsoft Research Releases Webwright: A Terminal-Native Web Agent Framework That Scores 60.1% on Odysseys, Up from Base GPT-5.4’s 33.5%

What Webwright Does Differently

The Agent Loop

Two Engineering Challenges

Benchmark Results

Cost Analysis

Small Model Performance

Marktechpost’s Visual Explainer

Key Takeaways

A Coding Guide to Build a Scalable Multi-Agent System with Google ADK

A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing

How to Design an Agentic AI Architecture with LangGraph and OpenAI Using Adaptive Deliberation, Memory Graphs, and Reflexion Loops

Microsoft AI Introduces Code Researcher: A Deep Research Agent for Large Systems Code and Commit History

How to build autonomous AI agent with Google A2A protocol

How to Build a Django-Unfold Admin Dashboard with Custom Models, Filters, Actions, and KPIs

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What Webwright Does Differently

The Agent Loop

Two Engineering Challenges

Benchmark Results

Cost Analysis

Small Model Performance

Marktechpost’s Visual Explainer

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!