|

Meta AI’s ‘Early Experience’ Trains Language Agents without Rewards—and Outperforms Imitation Learning

How would your agent stack change if a coverage might practice purely from its personal outcome-grounded rollouts—no rewards, no demos—but beat imitation studying throughout eight benchmarks? Meta Superintelligence Labs suggest ‘Early Experience‘, a reward-free coaching method that improves coverage studying in language brokers without massive human demonstration units and without reinforcement studying (RL) in the primary loop. The core concept is straightforward: let the agent department from knowledgeable states, take its personal actions, acquire the ensuing future states, and convert these penalties into supervision. The analysis staff instantiates this with two concrete methods—Implicit World Modeling (IWM) and Self-Reflection (SR)—and studies constant good points throughout eight environments and a number of base fashions.

https://arxiv.org/pdf/2510.08558

What Early Experience adjustments?

Traditional pipelines lean on imitation studying (IL) over knowledgeable trajectories, which is reasonable to optimize however laborious to scale and brittle out-of-distribution; reinforcement studying (RL) guarantees studying from expertise however wants verifiable rewards and steady infrastructure—usually lacking in net and multi-tool settings. Early Experience sits between them: it’s reward-free like imitation studying (IL), however the supervision is grounded in penalties of the agent’s personal actions, not simply knowledgeable actions. In quick, the agent proposes, acts, and learns from what really occurs subsequent—no reward operate required.

  • Implicit World Modeling (IWM): Train the mannequin to foretell the subsequent commentary given the state and chosen motion, tightening the agent’s inner mannequin of surroundings dynamics and lowering off-policy drift.
  • Self-Reflection (SR): Present knowledgeable and different actions on the identical state; have the mannequin clarify why the knowledgeable motion is best utilizing the noticed outcomes, then fine-tune the coverage from this contrastive sign.

Both methods use the identical budgets and decoding settings as IL; solely the information supply differs (agent-generated branches somewhat than extra knowledgeable trajectories).

https://arxiv.org/pdf/2510.08558

Understanding the Benchmarks

The analysis staff consider on eight language-agent environments spanning net navigation, long-horizon planning, scientific/embodied duties, and multi-domain API workflows—e.g., WebShop (transactional looking), TravelPlanner (constraint-rich planning), ScienceWorld, ALFWorld, Tau-Bench, and others. Early Experience yields common absolute good points of +9.6 success and +9.4 out-of-domain (OOD) over IL throughout the total matrix of duties and fashions. These good points persist when the identical checkpoints are used to initialize RL (GRPO), enhancing post-RL ceilings by as much as +6.4 in comparison with reinforcement studying (RL) began from imitation studying (IL).

Efficiency: much less knowledgeable knowledge, identical optimization funds

A key sensible win is demo effectivity. With a set optimization funds, Early Experience matches or beats IL utilizing a fraction of knowledgeable knowledge. On WebShop, 1/8 of the demonstrations with Early Experience already exceeds IL skilled on the full demo set; on ALFWorld, parity is hit at 1/2 the demos. The benefit grows with extra demonstrations, indicating the agent-generated future states present supervision indicators that demonstrations alone don’t seize.

How the information is constructed?

The pipeline seeds from a restricted set of knowledgeable rollouts to acquire consultant states. At chosen states, the agent proposes different actions, executes them, and information the subsequent observations.

  • For IWM, the coaching knowledge are triplets ⟨state, motion, next-state⟩ and the target is next-state prediction.
  • For SR, the prompts embody the knowledgeable motion and several other alternate options plus their noticed outcomes; the mannequin produces a grounded rationale explaining why the knowledgeable motion is preferable, and this supervision is then used to enhance the coverage.

Where reinforcement studying (RL) matches?

Early Experience is not “RL without rewards.” It is a supervised recipe that makes use of agent-experienced outcomes as labels. In environments with verifiable rewards, the analysis staff merely add RL after Early Experience. Because the initialization is best than IL, the identical RL schedule climbs larger and sooner, with as much as +6.4 remaining success over IL-initialized RL throughout examined domains. This positions Early Experience as a bridge: reward-free pre-training from penalties, adopted (the place potential) by commonplace reinforcement studying (RL).

Key Takeaways

  • Reward-free coaching through agent-generated future states (not rewards) utilizing Implicit World Modeling and Self-Reflection outperforms imitation studying throughout eight environments.
  • Reported absolute good points over IL: +18.4 (WebShop), +15.0 (TravelPlanner), +13.3 (ScienceWorld) below matched budgets and settings.
  • Demo effectivity: exceeds IL on WebShop with 1/8 of demonstrations; reaches ALFWorld parity with 1/2—at fastened optimization value.
  • As an initializer, Early Experience boosts subsequent RL (GRPO) endpoints by as much as +6.4 versus RL began from IL.
  • Validated on a number of spine households (3B–8B) with constant in-domain and out-of-domain enhancements; positioned as a bridge between imitation studying (IL) and reinforcement studying (RL).

Editorial Comments

Early Experience is a practical contribution: it replaces brittle rationale-only augmentation with outcome-grounded supervision that an agent can generate at scale, without reward features. The two variants—Implicit World Modeling (next-observation prediction to anchor surroundings dynamics) and Self-Reflection (contrastive, outcome-verified rationales in opposition to knowledgeable actions)—immediately assault off-policy drift and long-horizon error accumulation, explaining the constant good points over imitation studying throughout eight environments and the stronger RL ceilings when used as an initializer for GRPO. In net and tool-use settings the place verifiable rewards are scarce, this reward-free supervision is the lacking center between IL and RL and is instantly actionable for manufacturing agent stacks.


Check out the PAPER here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish Meta AI’s ‘Early Experience’ Trains Language Agents without Rewards—and Outperforms Imitation Learning appeared first on MarkTechPost.

Similar Posts