|

Meet Alibaba’s Page Agent: A JavaScript In-Page GUI Agent That Controls Web Interfaces With Natural Language Through the DOM

Most browser automation runs from the outdoors. Playwright, Puppeteer, Selenium, and browser-use all drive a browser from an exterior course of. They learn the web page by screenshots or the Chrome DevTools Protocol.

Alibaba’s Page Agent takes the reverse path. The agent lives inside the webpage as plain JavaScript. It reads the reside DOM as textual content and acts as the actual person. No headless browser, no screenshots, no multi-modal mannequin.

The mission is open-source beneath the MIT license. The codebase is TypeScript-first. It builds on browser-use, from which its DOM processing and immediate are derived.

TL;DR

  • Page Agent runs inside the web page as JavaScript, studying the reside DOM as textual content, not screenshots.
  • DOM dehydration compresses the web page right into a FlatDomTree so smaller textual content fashions can act exactly.
  • It is model-agnostic by any OpenAI-compatible endpoint and ships beneath the MIT license.
  • Prompt-level security and single-page scope are actual limits; hold server-side validation for dangerous actions.
  • Best match: copilots and form-filling inside apps you personal, not exterior or locked-down websites.

What is Page Agent?

Page Agent is a client-side library for including agent conduct to an internet app. You embed it, then concern instructions in pure language. The agent finds parts, clicks buttons, and fills kinds from inside the web page.

Because it runs in the browser session, it inherits the person’s cookies, session, and authentication. There is not any separate backend to jot down. The present UI validation and safety guidelines keep in place.

The design is model-agnostic. You deliver your individual massive language mannequin by any OpenAI-compatible endpoint. Only textual content is shipped to the mannequin, so a powerful textual content mannequin is sufficient.

How DOM Dehydration Works

The core approach is what the group calls DOM dehydration. A fashionable web page can maintain hundreds of nodes. Sending uncooked HTML to a mannequin could be sluggish and costly.

When a command arrives, the agent scans the Document Object Model. It identifies each interactive aspect, resembling buttons, hyperlinks, and enter fields. Each aspect receives an index plus a job and a label.

The reside DOM is transformed right into a FlatDomTree, a clear textual content map of what issues. Redundant markup is stripped out. The mannequin reads this compact illustration, not pixels.

The interactive demo on this web page mirrors this loop. Watch the “Dehydrated DOM” and “Action hint” panels replace as instructions run.