Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

Researchers from Stanford, EPFL, and UNC introduce Weak-for-Strong Harnessing, W4S, a new Reinforcement Learning RL framework that trains a small meta-agent to design and refine code workflows that name a stronger executor mannequin. The meta-agent doesn’t fantastic tune the sturdy mannequin, it learns to orchestrate it. W4S formalizes workflow design as a multi flip Markov resolution course of, and trains the meta-agent with a technique known as Reinforcement Learning for Agentic Workflow Optimization, RLAO. The analysis crew experiences constant features throughout 11 benchmarks with a 7B meta-agent skilled for about 1 GPU hour.

W4S operates in turns. The state comprises job directions, the present workflow program, and suggestions from prior executions. An motion has 2 elements, an evaluation of what to change, and new Python workflow code that implements these adjustments. The setting executes the code on validation gadgets, returns accuracy and failure circumstances, and offers a new state for the following flip. The meta-agent can run a fast self verify on one pattern, if errors come up it makes an attempt up to 3 repairs, if errors persist the motion is skipped. This loop provides studying sign with out touching the weights of the sturdy executor.

W4S runs as an iterative loop
- Workflow era: The weak meta agent writes a new workflow that leverages the sturdy mannequin, expressed as executable Python code.
- Execution and suggestions: The sturdy mannequin executes the workflow on validation samples, then returns accuracy and error circumstances as suggestions.
- Refinement: The meta agent makes use of the suggestions to replace the evaluation and the workflow, then repeats the loop.
Reinforcement Learning for Agentic Workflow Optimization (RLAO)
RLAO is an offline reinforcement studying process over multi flip trajectories. At every iteration, the system samples a number of candidate actions, retains one of the best performing motion to advance the state, and shops the others for coaching. The coverage is optimized with reward weighted regression. The reward is sparse and compares present validation accuracy to historical past, a increased weight is given when the brand new outcome beats the earlier finest, a smaller weight is given when it beats the final iteration. This goal favors regular progress whereas controlling exploration price.

Understanding the Results
On HumanEval with GPT-4o-mini as executor, W4S achieves Pass@1 of 95.4, with about 33 minutes of workflow optimization, zero meta-agent API price, an optimization execution price of about 0.4 {dollars}, and about 2.7 minutes to execute the check set at about 0.5 {dollars}, for a complete of about 0.9 {dollars}. Under the identical executor, AFlow and ADAS path this quantity. The reported common features towards the strongest automated baseline vary from 2.9% to 24.6% throughout 11 benchmarks.
On math switch, the meta-agent is skilled on GSM Plus and MGSM with GPT-3.5-Turbo as executor, then evaluated on GSM8K, GSM Hard, and SVAMP. The paper experiences 86.5 on GSM8K and 61.8 on GSM Hard, each above automated baselines. This signifies that the realized orchestration transfers to associated duties with out re coaching the executor.
Across seen duties with GPT-4o-mini as executor, W4S surpasses coaching free automated strategies that don’t be taught a planner. The examine additionally runs ablations the place the meta-agent is skilled by supervised fantastic tuning fairly than RLAO, the RLAO agent yields higher accuracy underneath the identical compute funds. The analysis crew embrace a GRPO baseline on a 7B weak mannequin for GSM Hard, W4S outperforms it underneath restricted compute.
Iteration budgets matter. The analysis crew units W4S to about 10 optimization activates important tables, whereas AFlow runs about 20 turns and ADAS runs about 30 turns. Despite fewer turns, W4S achieves increased accuracy. This suggests that realized planning over code, mixed with validation suggestions, makes the search extra pattern environment friendly.

Key Takeaways
- W4S trains a 7B weak meta agent with RLAO to write Python workflows that harness stronger executors, modeled as a multi flip MDP.
- On HumanEval with GPT 4o mini as executor, W4S reaches Pass@1 of 95.4, with about 33 minutes optimization and about 0.9 {dollars} complete price, beating automated baselines underneath the identical executor.
- Across 11 benchmarks, W4S improves over the strongest baseline by 2.9% to 24.6%, whereas avoiding fantastic tuning of the sturdy mannequin.
- The technique runs an iterative loop, it generates a workflow, executes it on validation information, then refines it utilizing suggestions.
- ADAS and AFlow additionally program or search over code workflows, W4S differs by coaching a planner with offline reinforcement studying.
Editorial Comments
W4S targets orchestration, not mannequin weights, and trains a 7B meta agent to program workflows that name stronger executors. W4S formalizes workflow design as a multi flip MDP and optimizes the planner with RLAO utilizing offline trajectories and reward weighted regression. Reported outcomes present Pass@1 of 95.4 on HumanEval with GPT 4o mini, common features of two.9% to 24.6% throughout 11 benchmarks, and about 1 GPU hour of coaching for the meta agent. The framing compares cleanly with ADAS and AFlow, which search agent designs or code graphs, whereas W4S fixes the executor and learns the planner.
Check out the Technical Paper and GitHub Repo. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t neglect to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The publish Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs appeared first on MarkTechPost.