UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap between General-Purpose GUI Agents and Specialized API-based Agents

Computer-use brokers have been restricted to primitives. They click on, they kind, they scroll. Long motion chains amplify grounding errors and waste steps. Apple Researchers introduce UltraCUA, a basis mannequin that builds an hybrid motion area that lets an agent interleave low degree GUI actions with excessive degree programmatic device calls. The mannequin chooses the cheaper and extra dependable transfer at every step. The method improves success and reduces steps on OSWorld, and transfers to WindowsAgentArena with out Windows particular coaching.

What hybrid motion modifications?
Hybrid motion treats instruments as top quality actions. A device name encapsulates a multi step operation as a single operate with a transparent signature and a docstring. A click on or a key press nonetheless exists when no programmatic path is offered. The agent learns to alternate between each modes. The purpose is to scale back cascade errors and to chop step counts. The analysis group positions this as a bridge between GUI solely CUAs and device centric agent frameworks.

Scaled device acquisition
UltraCUA builds its device library with an automatic pipeline. The system extracts keyboard shortcuts and instructions from software program documentation. The system integrates open supply implementations from agent toolkits. The system additionally makes use of coding brokers to synthesize new instruments. Each device is a callable interface that hides a protracted GUI sequence. The analysis group experiences protection throughout 10 desktop domains with 881 instruments. The largest buckets embody VS Code with 135 instruments and LibreOffice Writer with 123 instruments. Thunderbird and GIMP even have deep protection.

Verifiable artificial duties and trajectories
Training requires grounded supervision and secure rewards. UltraCUA makes use of a twin artificial engine. An evaluator first pipeline composes atomic verifiers for browsers, recordsdata, photographs, and system state, then generates duties that fulfill these checks. An instruction first pipeline explores the OS and proposes context aligned duties that are then verified. The result’s 17,864 verifiable duties throughout 10 domains akin to Chrome, LibreOffice, GIMP, VS Code, system, Thunderbird, VLC, and multi app workflows. Chrome has 2,826 duties. The LibreOffice suite sums to five,885 duties. Multi app duties attain 2,113.

A multi agent rollout produces profitable hybrid trajectories. The planner makes use of OpenAI o3 for choice making. The grounder makes use of GTA1-7B for correct visible localization. The rollout yields about 26.8K profitable trajectories that present when to make use of a device and when to behave in the GUI. These trajectories are the core of the supervised section.
Training Approach
Training has two levels. Stage 1 is supervised effective tuning. The fashions prepare for 3 epochs at a studying fee of 2e-5 on the profitable trajectories. Loss is utilized flip clever to keep away from over weighting early steps. Stage 2 is on-line reinforcement studying. The fashions prepare for 150 steps at a studying fee of 1e-6 on verified duties that are sampled by problem. The coverage optimization follows a GRPO variant with clip increased, and removes KL regularization and format rewards. The reward combines sparse job consequence with a device use time period. Experiments use NVIDIA H100 GPUs. The context is stored close to 32K by controlling the variety of uncovered instruments.
Results on OSWorld
UltraCUA improves success at each 7B and 32B scales. Under 15 step budgets, UltraCUA-32B reaches 41.0 p.c success. OpenCUA-32B reaches 29.7 p.c. The absolute acquire is 11.3 factors. UltraCUA-7B reaches 28.9 p.c. UI-TARS-1.5-7B reaches 23.4 p.c. Gains persist beneath 50 step budgets. A per area breakdown reveals constant lifts throughout Chrome, Writer, VS Code, and cross utility duties. Average steps lower towards baselines. These shifts point out higher motion choice somewhat than solely extra makes an attempt.


Cross platform switch on WindowsAgentArena
UltraCUA trains solely on Ubuntu based mostly OSWorld information. The mannequin is then evaluated on WindowsAgentArena. UltraCUA-7B reaches 21.7 p.c success. This exceeds UI-TARS-1.5-7B at 18.1 p.c and a Qwen2 baseline educated with Windows information at 13.5 p.c. The consequence suggests that hybrid motion methods discovered on one platform switch to different platforms. The paper highlights this as zero shot platform generalization.

Key Takeaways
- UltraCUA formalizes a hybrid motion area that lets a single agent alternate between GUI primitives and programmatic device calls, which reduces lengthy error inclined motion chains.
- The analysis group scales a reusable device library via an automatic pipeline and pairs it with an artificial information engine, yielding 17,000 plus verifiable laptop use duties for coaching and analysis.
- Training follows a two stage recipe, supervised effective tuning on profitable hybrid trajectories then on-line reinforcement studying on verifiable duties, which optimizes when to name instruments versus act in the GUI.
- On OSWorld, UltraCUA experiences a median 22 p.c relative enchancment over base fashions and 11 p.c fewer steps, which signifies positive aspects in reliability and effectivity.
- The 7B mannequin reaches 21.7 p.c success on WindowsAgentArena with out Windows particular coaching, which reveals cross platform generalization of the hybrid motion coverage.
Editorial Comments
UltraCUA strikes laptop use brokers from brittle primitive motion chains to a hybrid motion coverage, integrating GUI primitives with programmatic device calls, which reduces error propagation and step counts. It scales instruments through an automatic pipeline and pairs them with an artificial information engine that yields 17,000 plus verifiable duties, enabling supervised effective tuning and on-line reinforcement studying on grounded indicators. Reported outcomes embody 22 p.c relative enchancment on OSWorld with 11 p.c fewer steps, and 21.7 p.c success on WindowsAgentArena with out Windows particular coaching, which signifies cross platform switch of the coverage.
Check out the Paper here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The publish UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap between General-Purpose GUI Agents and Specialized API-based Agents appeared first on MarkTechPost.