Meet OSGym: A New OS Infrastructure Framework That Manages 1,000+ Replicas at $0.23/Day for Computer Use Agent Research
Training AI brokers that may truly use a pc — opening apps, clicking buttons, shopping the online, writing code — is likely one of the hardest infrastructure issues in fashionable AI. It’s not a knowledge drawback. It’s not a mannequin drawback. It’s a plumbing drawback.
You have to spin up tons of, probably 1000’s, of full working system environments with precise graphical person interfaces. Each one must run actual software program. Each one must deal with unpredictable crashes. And you want all of them to run concurrently at a value that doesn’t bankrupt a college analysis lab.
That’s the issue ‘OSGym‘, a brand new analysis from a workforce of researchers at MIT, UIUC, CMU, USC, UVA, and UC Berkeley, is designed to resolve.

What is a Computer Use Agent?
Before unpacking the infrastructure, it helps to grasp what a pc use agent truly is. Unlike a chatbot that responds to textual content prompts, a pc use agent observes a screenshot of a desktop, decides what to do — click on a button, sort textual content, open a file — and executes that motion by keyboard and mouse inputs. Think of it as an AI that may function any software program the best way a human would.
Models like Anthropic’s Claude Computer Use and OpenAI’s Operator are early industrial examples. Research fashions like UI-TARS, Agent-S2, and CogAgent are pushing the boundaries additional. But coaching any of those techniques requires large quantities of interplay information generated inside actual OS environments — and that’s the place issues get costly and sophisticated quick.
The Core Problem: OS Sandboxes at Scale
A coding surroundings or an internet browser sandbox is comparatively light-weight to run. A full OS sandbox with a GUI will not be. Each digital machine wants its personal bootable disk (round 24 GB), its personal CPU and RAM allocation, and its personal show stack. Multiply that by tons of or 1000’s of parallel cases and you’ve got a useful resource consumption drawback that typical educational compute budgets merely can not take up.
On high of useful resource prices, there’s the reliability drawback. Software crashes. Browser classes trip. Applications freeze. If your coaching pipeline doesn’t deal with these failures gracefully, one unhealthy VM can stall a whole coaching batch.
OSGym tackles each issues with 4 distinct architectural optimizations.
Decentralized OS State Management
The first design alternative considerations how the system manages the state of every OS duplicate — monitoring whether or not it’s wholesome, what job it’s operating, and the way to recuperate it if one thing goes fallacious.
A naive strategy makes use of a single centralized supervisor for all replicas. This is a traditional single level of failure: as duplicate rely grows into the 1000’s, the central supervisor turns into overwhelmed, latency will increase, and one crash can halt the entire system. OSGym as an alternative offers each OS duplicate its personal devoted state supervisor. Each state supervisor exposes public strategies modeled after the OpenAI Gym API — reset, step, and shutdown — however handles its personal well being monitoring and crash restoration internally. A failure in a single duplicate can not propagate to some other.
Hardware-Aware OS Replica Orchestration
Here’s a non-obvious perception this analysis surfaces: whenever you run many OS replicas on a single server, the bottleneck will depend on what number of replicas you pack per machine. For a small variety of replicas per server (low Okay), the system is CPU-bounded — most replicas are combating over processor time. But as you pack extra replicas per server (massive Okay), the bottleneck shifts to RAM — and RAM is dramatically cheaper than CPU.
A 32 GB DDR4 RAM module sometimes prices 10–20% of what a 16-core CPU prices. OSGym runs replicas as Docker containers (utilizing Docker pictures from OSWorld as a basis) slightly than full Virtual Machines to scale back per-replica overhead. By selecting servers with increased RAM capability and operating extra replicas per machine, the every day value drops from round $300 for 128 replicas at Okay=1, to roughly $30 at Okay=64 — roughly $0.234 per duplicate per day, a quantity that matches comfortably inside many educational grant budgets.
KVM Virtualization with Copy-on-Write Disk Management
The disk provisioning drawback is solved with a filesystem method referred to as reflink copy-on-write (CoW). Normally, spinning up 128 VM cases would imply duplicating a 24 GB base picture 128 occasions — over 3 TB of storage and 30 seconds of provisioning time per VM.
OSGym as an alternative makes use of cp --reflink=all the time on XFS-formatted NVMe drives. Each per-VM disk picture shares bodily disk blocks with the bottom picture and solely allocates new blocks when the VM truly writes to them. The outcome: 128 VMs devour 366 GB of bodily disk as an alternative of three.1 TB — an 88% discount — and disk provisioning time drops from 30 seconds to 0.8 seconds per VM, a 37× speedup. Each VM nonetheless sees its full 24 GB logical disk with near-native CPU efficiency.
Robust Container Pool with Multi-Layer Fault Recovery
OSGym maintains a pre-warmed runner pool — by default, 128 runners per executor node — initialized earlier than coaching begins. Rather than creating and destroying VMs on demand, runners are recycled between duties. Before every VM creation, OSGym reads /proc/meminfo and /proc/loadavg to confirm the host can safely accommodate one other occasion, blocking creation if accessible reminiscence falls beneath 10% or underneath 8 GB absolute. Each container is memory-limited to six GB to stop over-provisioning underneath burst eventualities.
The system additionally tunes Linux kernel parameters that will in any other case trigger silent failures at excessive concurrency — for instance, fs.aio-max-nr is raised from 65,536 to 1,048,576, and fs.inotify.max_user_instances from 128 to eight,192. Fault restoration operates at two ranges: at the step stage, every motion will get as much as 10 retries by default; at the duty stage, if a runner fails completely, the duty is mechanically reassigned to a recent runner.
Unified Task Flow and Centralized Data Server
Two design components which are notably necessary for devs integrating OSGym: each job follows a four-phase unified execution circulate — Configure, Reset, Operate, Evaluate — no matter which software program or area is concerned. This standardization makes it easy so as to add new job sorts with out altering the encircling infrastructure.
Above the duplicate layer, a centralized information server Python class exposes a single-entry batched interface (__next__ and async_step) that hides all of the complexity of state supervisor communication and queuing. The batched step methodology is asynchronous, which means the coaching loop isn’t blocked whereas ready for OS replicas to finish their actions.
What the Numbers Look Like in Practice
Using 1,024 parallel OS replicas, the system collected trajectories throughout ten job classes — together with LibreOffice Writer, Calc, and Impress, Chrome, ThunderBird, VLC, VS Code, GIMP, OS system configuration, and multi-app workflows — at roughly 1,420 trajectories per minute, versus 115,654 seconds with out parallelization. The complete dataset value $43 in cloud compute.
The analysis workforce then used that information to fine-tune Qwen2.5-VL 32B through supervised fine-tuning, adopted by reinforcement studying utilizing a PPO-based semi-online asynchronous pipeline (200 steps, batch measurement 64, studying fee 1e-6). The ensuing mannequin achieved a 56.3% success fee on the OSWorld-Verified benchmark — aggressive with present strategies for a 32B parameter base mannequin with no task-specific tuning.
Key Takeaways
- Training pc use brokers is an infrastructure drawback first: Full OS sandboxes with GUIs are far heavier than coding or browser environments — every VM wants ~24 GB of disk, devoted CPU and RAM, and a show stack. Without cautious optimization, scaling to tons of of replicas is just unaffordable for most educational labs.
- RAM is a better scaling lever than CPU: OSGym’s hardware-aware orchestration reveals that packing extra replicas per server shifts the bottleneck from CPU to RAM — and RAM is 5–10× cheaper. This single perception cuts per-replica value from ~$2.10/day to as little as $0.23/day.
- Copy-on-write disk administration eliminates the storage wall. By utilizing XFS reflink CoW (
cp --reflink=all the time), OSGym reduces bodily disk consumption by 88% and quickens VM disk provisioning by 37× — turning a 3.1 TB, 30-second-per-VM drawback right into a 366 GB, 0.8-second one. - Decentralized state administration is the important thing to robustness at scale. Giving every OS duplicate its personal devoted state supervisor means failures keep remoted. Even ranging from a totally crashed state, OSGym self-recovers all replicas inside a brief window — essential for uninterrupted long-running coaching jobs.
- Academic-scale pc use agent analysis is now financially viable. With 1,024 replicas producing 1,420 trajectories per minute and a full dataset costing simply $43 in cloud compute, OSGym brings the infrastructure value of coaching general-purpose pc brokers inside attain of college analysis budgets.
Check out the Paper here. Also, be happy to observe us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit Meet OSGym: A New OS Infrastructure Framework That Manages 1,000+ Replicas at $0.23/Day for Computer Use Agent Research appeared first on MarkTechPost.
