Stanford Researchers Released AgentFlow: In-the-Flow Reinforcement Learning RL for Modular, Tool-Using AI Agents

TL;DR: AgentFlow is a trainable agent framework with 4 modules—Planner, Executor, Verifier, Generator—coordinated by an express reminiscence and toolset. The planner is optimized within the loop with a brand new on-policy methodology, Flow-GRPO, which broadcasts a trajectory-level consequence reward to each flip and applies token-level PPO-style updates with KL regularization and group-normalized benefits. On ten benchmarks, a 7B spine tuned with Flow-GRPO reviews +14.9% (search), +14.0% (agentic), +14.5% (math), and +4.1% (science) over sturdy baselines.
What is AgentFlow?
AgentFlow formalizes multi-turn, tool-integrated reasoning as an Markov Decision Process (MDP). At every flip, the Planner proposes a sub-goal and selects a software plus context; the Executor calls the software; the Verifier alerts whether or not to proceed; the Generator emits the ultimate reply on termination. A structured, evolving reminiscence data states, software calls, and verification alerts, constraining context development and making trajectories auditable. Only the planner is skilled; different modules will be mounted engines.
The public implementation showcases a modular toolkit (e.g., base_generator
, python_coder
, google_search
, wikipedia_search
, web_search
) and ships quick-start scripts for inference, coaching, and benchmarking. The repository is MIT-licensed.

Training methodology: Flow-GRPO
Flow-GRPO (Flow-based Group Refined Policy Optimization) converts long-horizon, sparse-reward optimization into tractable single-turn updates:
- Final-outcome reward broadcast: a single, verifiable trajectory-level sign (LLM-as-judge correctness) is assigned to each flip, aligning native planning steps with world success.
- Token-level clipped goal: importance-weighted ratios are computed per token, with PPO-style clipping and a KL penalty to a reference coverage to forestall drift.
- Group-normalized benefits: variance discount throughout teams of on-policy rollouts stabilizes updates.

Understanding the outcomes and benchmarks
Benchmarks. The analysis workforce evaluates 4 activity sorts: knowledge-intensive search (Bamboogle, 2Wiki, HotpotQA, Musique), agentic reasoning (GAIA textual cut up), math (AIME-24, AMC-23, Game of 24), and science (GPQA, MedQA). GAIA is a tooling-oriented benchmark for common assistants; the textual cut up excludes multimodal necessities.
Main numbers (7B spine after Flow-GRPO). Average positive factors over sturdy baselines: +14.9% (search), +14.0% (agentic), +14.5% (math), +4.1% (science). The analysis workforce state their 7B system surpasses GPT-4o on the reported suite. The challenge web page additionally reviews coaching results similar to improved planning high quality, lowered tool-calling errors (as much as 28.4% on GAIA), and constructive tendencies with bigger flip budgets and mannequin scale.
Ablations. Online Flow-GRPO improves efficiency by +17.2% vs. a frozen-planner baseline, whereas offline supervised fine-tuning of the planner degrades efficiency by −19.0% on their composite metric.

Key Takeaways
- Modular agent, planner-only coaching. AgentFlow constructions an agent into Planner–Executor–Verifier–Generator with an express reminiscence; solely the Planner is skilled in-loop.
- Flow-GRPO converts long-horizon RL to single-turn updates. A trajectory-level consequence reward is broadcast to each flip; updates use token-level PPO-style clipping with KL regularization and group-normalized benefits.
- The analysis team-reported positive factors on 10 benchmarks. With a 7B spine, AgentFlow reviews common enhancements of +14.9% (search), +14.0% (agentic/GAIA textual), +14.5% (math), +4.1% (science) over sturdy baselines, and states surpassing GPT-4o on the identical suite.
- Tool-use reliability improves. The analysis workforce report lowered tool-calling errors (e.g., on GAIA) and higher planning high quality beneath bigger flip budgets and mannequin scale.
Editorial Comments
AgentFlow formalizes tool-using brokers into 4 modules (planner, executor, verifier, generator) and trains solely the planner in-loop through Flow-GRPO, which broadcasts a single trajectory-level reward to each flip with token-level PPO-style updates and KL management. Reported outcomes on ten benchmarks present common positive factors of +14.9% (search), +14.0% (agentic/GAIA textual cut up), +14.5% (math), and +4.1% (science); the analysis workforce moreover state the 7B system surpasses GPT-4o on this suite. Implementation, instruments, and quick-start scripts are MIT-licensed within the GitHub repo.
Check out the Technical Paper, GitHub Page and Project Page. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The submit Stanford Researchers Released AgentFlow: In-the-Flow Reinforcement Learning RL for Modular, Tool-Using AI Agents appeared first on MarkTechPost.