|

Meta AI Introduces DreamGym: A Textual Experience Synthesizer For Reinforcement learning RL Agents

Reinforcement learning RL for big language mannequin LLM brokers seems to be engaging on paper, however in follow it breaks on price, infrastructure and reward noise. Training an agent that clicks by means of net pages or completes multi step software use can simply want tens of 1000’s of actual interactions, every gradual, brittle and arduous to reset. Meta’s new framework DreamGym reframes that bottleneck as a modeling downside. Instead of working RL straight in environments equivalent to WebStore, ALFWorld and WebEnviornment Lite, it learns a reasoning primarily based expertise mannequin that simulates them totally in textual content.

https://arxiv.org/pdf/2511.03773

Why Real Environment RL for Agents Does Not Scale?

Current RL pipelines for brokers face 4 coupled issues. Real rollouts are expensive, activity range is restricted, reward indicators are unstable and the infrastructure stack is advanced. Web environments change usually, rewards depend upon fragile scrapers and lots of actions are irreversible. Reset mechanisms and episode management are additionally arduous to implement, so lengthy horizon duties turn out to be noisy and pattern inefficient.

Benchmarks break up into two teams. WebStore and ALFWorld are RL prepared however costly, since they nonetheless want about 80 thousand actual transitions to achieve robust baselines with PPO or GRPO. WebEnviornment Lite is just not RL prepared in any respect, as a result of resets and automated reward checks are unreliable, so on-line RL in the actual setting is successfully infeasible.

DreamGym as a Reasoning Based Simulator

DreamGym is constructed round three elements, a reasoning primarily based expertise mannequin, an expertise replay buffer and an adaptive curriculum activity generator. Together they outline an artificial Markov determination course of the place the setting lives as textual content.

The reasoning primarily based expertise mannequin Mexp operates in an summary textual state house. States are compact descriptions of what issues for the duty, for instance cleaned web page parts as a substitute of uncooked HTML. On every step, the agent offers the present state, the motion, the duty instruction and the interplay historical past. The system retrieves the highest okay comparable previous transitions from the replay buffer, then makes use of chain of thought reasoning to supply a reasoning hint, a subsequent state and a reward.

Conceptually, you possibly can view Mexp as an LLM world mannequin for net and power duties, however outlined purely over textual content. It is educated with supervised fantastic tuning on offline trajectories, with a joint goal that learns to generate each the reasoning hint and the following state conditioned on that hint. This forces the mannequin to encode causal construction, not simply native textual content statistics.

https://arxiv.org/pdf/2511.03773

Replay Buffer as Grounding Memory

The expertise replay buffer is initialized with offline actual setting information from WebStore, ALFWorld and WebEnviornment Lite. As DreamGym trains insurance policies within the artificial setting, it writes new trajectories again into that buffer. Each prediction step in Mexp makes use of an encoder to retrieve a small set of comparable transitions from this reminiscence and situations on them when producing reasoning and subsequent states.

This retrieval acts as grounding. It retains artificial transitions near the empirical information distribution and reduces hallucinations in lengthy rollouts. The analysis crew confirmed that eradicating historical past or retrieval degrades consistency, informativeness and factuality of the generated states when judged by an exterior evaluator, and it additionally lowers downstream success charges on WebStore and WebEnviornment Lite.

Curriculum from Reward Entropy

The curriculum activity generator makes use of the identical spine because the expertise mannequin. It selects seed duties whose outcomes underneath the present coverage have excessive reward variance, which corresponds to intermediate issue duties that the agent generally solves and generally fails. For every such activity, the mannequin generates variations that protect motion sorts however change constraints, targets or context.

The choice heuristic is predicated on reward entropy computed over batches of rollouts for every activity. Tasks with non zero variance and balanced success and failure are most well-liked. Ablations present that turning off this adaptive curriculum causes each WebStore and WebEnviornment Lite efficiency to drop by round 6 proportion factors and results in early plateaus because the replay buffer saturates with straightforward, low entropy trajectories.

https://arxiv.org/pdf/2511.03773

RL Inside DreamGym and Theoretical Guarantees

Inside DreamGym, the coverage makes use of customary RL algorithms. The analysis crew evaluates Proximal Policy Optimization and Group Relative Policy Optimization. Rollouts alternate between the coverage selecting actions and the expertise mannequin synthesizing subsequent states and rewards. From the viewpoint of the RL code, that is simply one other setting interface.

The analysis crew additionally derive a belief area type enchancment certain that hyperlinks coverage efficiency within the artificial MDP and in the actual setting. The certain accommodates error phrases that depend upon the reward prediction error and the divergence between actual and artificial transition distributions. As these errors shrink, enchancment in DreamGym implies enchancment within the underlying actual activity.

Experimental Results on WebStore, ALFWorld and WebEnviornment Lite

DreamGym is examined with Llama-based and Qwen-based brokers throughout WebStore, ALFWorld and WebEnviornment Lite. Results fall into three regimes.

First, in RL prepared however expensive environments WebStore and ALFWorld, brokers educated with PPO or GRPO inside DreamGym, utilizing solely artificial transitions, match the efficiency of PPO and GRPO baselines that use about 80 thousand actual setting interactions. This exhibits that reasoning primarily based expertise synthesis can present sufficient sign for secure coverage enchancment.

Second, in not RL prepared environments equivalent to WebEnviornment Lite, DreamGym permits RL coaching that will in any other case be impractical. The framework achieves greater than 30 p.c enchancment in success charge over all baselines, together with supervised fantastic tuning and direct habits cloning.

Third, in sim to actual switch, the DreamGym-S2R configuration first trains a coverage totally within the artificial setting after which fantastic tunes it with a small variety of actual rollouts. This setting yields greater than 40 p.c extra achieve in contrast with coaching from scratch in the actual setting, whereas utilizing lower than 10 p.c of the actual information and reducing complete coaching price to roughly between one third and one fifth of the baselines.

https://arxiv.org/pdf/2511.03773

Key Takeaways

  1. DreamGym replaces fragile actual setting rollouts with a reasoning primarily based expertise mannequin that operates in an summary textual state house, predicting subsequent state and reward from historical past, activity and retrieved comparable transitions.
  2. The framework combines 3 elements, a reasoning expertise mannequin, an expertise replay buffer seeded with actual trajectories, and a curriculum activity generator that selects and varies duties utilizing a reward entropy heuristic, which collectively stabilize and diversify RL coaching.
  3. In WebStore and ALFWorld, that are RL prepared however costly, brokers educated with PPO or GRPO totally inside DreamGym utilizing artificial interactions match the efficiency of PPO and GRPO baselines that use about 80,000 actual setting transitions.
  4. In WebEnviornment Lite, which isn’t RL prepared, DreamGym permits on-line RL and achieves greater than 30 p.c increased success charge than all non RL baselines together with supervised fantastic tuning and habits cloning.
  5. In the sim to actual configuration, insurance policies pretrained in DreamGym after which fantastic tuned with a small variety of actual rollouts obtain greater than 40 p.c extra enchancment whereas utilizing lower than 10 p.c of the actual interplay finances and lowering complete coaching price to round one third to at least one fifth of normal RL.

Editorial Comments

DreamGym is a vital step towards sensible reinforcement learning for LLM brokers as a result of it reframes the setting as a reasoning primarily based expertise mannequin, grounded by an expertise replay buffer and a reward entropy pushed curriculum, moderately than as a fragile browser stack. The reported features on WebEnviornment Lite, WebStore and ALFWorld with PPO and GRPO counsel that artificial expertise plus Sim to Real adaptation can turn out to be a normal sample for agent coaching at scale. Overall, DreamGym makes the expertise mannequin, not the coverage, the principle lever for scaling RL brokers.


Check out the Full Paper. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up Meta AI Introduces DreamGym: A Textual Experience Synthesizer For Reinforcement learning RL Agents appeared first on MarkTechPost.

Similar Posts