Alibaba Qwen Team Releases Mobile-Agent-v3 and GUI-Owl: Next-Generation Multi-Agent Framework for GUI Automation

Desk of contents

Introduction: The Rise of GUI Brokers

Trendy computing is dominated by graphical person interfaces throughout gadgets—cell, desktop, and net. Automating duties in these environments has historically been restricted to scripted macros or brittle, hand-engineered guidelines. Latest advances in vision-language fashions supply the tantalizing chance of brokers that may perceive screens, cause about duties, and execute actions similar to people. Nonetheless, most approaches have both relied on closed-source, black-box fashions or have struggled with generalizability, reasoning constancy, and cross-platform robustness.

A staff of researchers from Alibaba Qwen introduce GUI-Owl and Cell-Agent-v3 that these challenges head-on. GUI-Owl is a local, end-to-end multimodal agent mannequin, constructed on Qwen2.5-VL and extensively post-trained on large-scale, numerous GUI interplay information. It unifies notion, grounding, reasoning, planning, and motion execution inside a single coverage community, enabling strong cross-platform interplay and specific multi-turn reasoning. The Cell-Agent-v3 framework leverages GUI-Owl as a foundational module, orchestrating a number of specialised brokers (Supervisor, Employee, Reflector, Notetaker) to deal with advanced, long-horizon duties with dynamic planning, reflection, and reminiscence.

Structure and Core Capabilities

GUI-Owl: The Foundational Mannequin

GUI-Owl is designed from the bottom as much as deal with the heterogeneity and dynamism of real-world GUI environments. It’s initialized from Qwen2.5-VL, a state-of-the-art vision-language mannequin, however undergoes in depth further coaching on specialised GUI datasets. This contains grounding (finding UI components from pure language queries), job planning (breaking down advanced directions into actionable steps), and motion semantics (understanding how actions have an effect on the GUI state). The mannequin is fine-tuned through a mixture of supervised studying and reinforcement studying (RL), with a concentrate on aligning its choices with real-world job success.

Key Improvements in GUI-Owl:

Unified Coverage Community: In contrast to prior analysis that separates notion, planning, and execution into disjoint modules, GUI-Owl integrates these capabilities right into a single neural community. This permits for seamless multi-turn decision-making and specific intermediate reasoning—essential for dealing with the anomaly and variability of actual GUIs.
Scalable Coaching Infrastructure: The staff constructed a cloud-based digital surroundings spanning Android, Ubuntu, macOS, and Home windows. This “Self-Evolving GUI Trajectory Manufacturing” pipeline generates high-quality interplay information by having GUI-Owl and Cell-Agent-v3 work together with digital gadgets, then rigorously judging the correctness of trajectories. Profitable trajectories are used for additional coaching, making a virtuous cycle of enchancment.
Various Information Synthesis: To show the mannequin strong grounding and reasoning, the analysis staff make use of quite a lot of information synthesis methods: synthesizing UI factor grounding duties from accessibility timber and crawled screenshots, distilling job planning data from each historic trajectories and huge pretrained LLMs, and producing motion semantics information by having the mannequin predict the impact of actions given before-and-after screenshots.
Reinforcement Studying Alignment: GUI-Owl is additional refined through a scalable RL framework that helps absolutely asynchronous coaching and a novel “Trajectory-aware Relative Coverage Optimization” (TRPO). TRPO assigns credit score throughout lengthy, variable-length motion sequences—a important advance for GUI duties the place rewards are sparse and solely accessible upon job completion.

Cell-Agent-v3: Multi-Agent Coordination

Cell-Agent-v3 is a general-purpose agentic framework designed to sort out advanced, multi-step, and cross-application workflows. It breaks duties into subgoals, dynamically updates plans based mostly on execution suggestions, and maintains persistent contextual reminiscence. The framework coordinates 4 specialised brokers:

Supervisor Agent: Decomposes high-level directions into subgoals, dynamically updating the plan based mostly on outcomes and suggestions.
Employee Agent: Executes essentially the most related actionable subgoal given the present GUI state, prior suggestions, and gathered notes.
Reflector Agent: Evaluates the result of every motion, evaluating supposed and precise state transitions to generate diagnostic suggestions.
Notetaker Agent: Persists important data (e.g., codes, credentials) throughout software boundaries, enabling long-horizon duties.

Coaching and Information Pipeline

A significant bottleneck in GUI agent growth is the dearth of high-quality, scalable coaching information. Conventional approaches depend on costly guide annotation, which doesn’t scale to the range and dynamism of actual GUIs. The GUI-Owl staff addresses this with a self-evolving information manufacturing pipeline:

Question Era: For cell apps, a human-annotated directed acyclic graph (DAG) fashions lifelike navigation flows and slot-value pairs for person inputs. LLMs synthesize pure directions from these paths, that are additional refined and validated in opposition to actual app interfaces.
Trajectory Era: Given a question, GUI-Owl or Cell-Agent-v3 interacts with a digital surroundings to supply a trajectory—a sequence of actions and state transitions.
Trajectory Correctness Judgment: A two-level critic system evaluates every step (did the motion have the supposed impact?) and the general trajectory (did the duty succeed?). This makes use of each textual and multimodal reasoning, with consensus-based ultimate judgments.
Steering Synthesis: For difficult queries, the system synthesizes step-by-step steering from profitable (human or mannequin) trajectories, serving to the agent study from optimistic examples.
Iterative Coaching: Newly generated profitable trajectories are added to the coaching set, and the mannequin is retrained, closing the loop on self-improvement.

Benchmarking and Efficiency

GUI-Owl and Cell-Agent-v3 are rigorously evaluated throughout a collection of GUI automation benchmarks, overlaying grounding, single-step decision-making, query answering, and end-to-end job completion.

Grounding and UI Understanding

On grounding duties—finding UI components from pure language queries—GUI-Owl-7B outperforms all open-source fashions of comparable measurement, and GUI-Owl-32B surpasses even proprietary fashions like GPT-4o and Claude 3.7. For instance, on the MMBench-GUI L2 benchmark (overlaying Home windows, macOS, Linux, iOS, Android, and Internet), GUI-Owl-7B scores 80.49, whereas GUI-Owl-32B achieves 82.97, each properly forward of the competitors. On ScreenSpot Professional, which focuses on high-resolution, advanced interfaces, GUI-Owl-7B scores 54.9, considerably outperforming UI-TARS-72B and Qwen2.5-VL-72B. These outcomes exhibit that GUI-Owl’s grounding capabilities are each broad and deep, dealing with every thing from easy button clicks to fine-grained textual content localization.

Complete GUI Understanding and Single-Step Determination Making

MMBench-GUI L1 evaluates UI understanding and single-step decision-making via question-answering. Right here, GUI-Owl-7B scores 84.5 (straightforward), 86.9 (medium), and 90.9 (arduous), far outpacing all current fashions. This means not simply correct notion, however strong reasoning about interface states and actions. On Android Management, which focuses on single-step choices in pre-annotated contexts, GUI-Owl-7B achieves 72.8, the very best amongst 7B fashions, whereas GUI-Owl-32B reaches 76.6, surpassing even the biggest open and proprietary fashions.

Finish-to-Finish and Multi-Agent Capabilities

The true check of a GUI agent is its means to finish actual, multi-step duties in interactive environments. AndroidWorld and OSWorld are two such benchmarks, the place brokers should autonomously navigate apps and working methods to perform person directions. GUI-Owl-7B scores 66.4 on AndroidWorld and 34.9 on OSWorld, whereas Cell-Agent-v3 (with GUI-Owl as its core) achieves 73.3 and 37.7, respectively—a brand new state-of-the-art for open-source frameworks. The multi-agent design proves particularly efficient on long-horizon, error-prone duties, because the Reflector and Supervisor brokers allow dynamic replanning and restoration from errors.

Actual-World Integration

The analysis staff additionally evaluated GUI-Owl’s efficiency because the “mind” inside established agentic frameworks like Cell-Agent-E (Android) and Agent-S2 (desktop). Right here, GUI-Owl-32B achieves 62.1% success on AndroidWorld and 48.4% on a difficult subset of OSWorld, considerably outperforming all baselines. This underscores GUI-Owl’s sensible worth as a plug-and-play module for numerous agent methods.

Actual-World Deployment

GUI-Owl helps a wealthy, platform-specific motion area. On cell, this contains clicks, lengthy presses, swipes, textual content entry, system buttons (again, residence, and many others.), and software launching. On desktop, actions embody mouse strikes, clicks, drags, scrolls, keyboard enter, and application-specific instructions. Actions are translated into low-level machine instructions (ADB for Android, pyautogui for desktop), making the framework readily deployable in actual environments.

The agent’s reasoning and choice course of is clear: for every step, it observes the display screen, remembers compressed historical past, causes concerning the subsequent motion, summarizes its intent, and executes. This specific intermediate reasoning not solely improves robustness but in addition allows integration into bigger multi-agent methods, the place totally different “roles” (e.g., planner, executor, critic) can specialize and collaborate.

Conclusion: Towards Common-Function GUI Brokers

GUI-Owl and Cell-Agent-v3 characterize a serious leap towards general-purpose, autonomous GUI brokers. By unifying notion, grounding, reasoning, and motion right into a single mannequin, and by constructing a scalable, self-improving coaching pipeline, the analysis staff have achieved state-of-the-art efficiency throughout each cell and desktop environments, surpassing even the biggest proprietary fashions in key benchmarks.

Take a look at the PAPER and GITHUB PAGE. Be at liberty to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

The publish Alibaba Qwen Team Releases Mobile-Agent-v3 and GUI-Owl: Next-Generation Multi-Agent Framework for GUI Automation appeared first on MarkTechPost.

Alibaba Qwen Team Releases Mobile-Agent-v3 and GUI-Owl: Next-Generation Multi-Agent Framework for GUI Automation

Desk of contents

Introduction: The Rise of GUI Brokers