A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples

Do curated, tool-grounded demonstrations construct stronger software program brokers than broad piles of generic instruction information? A workforce of researchers from Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) proposes LIMI (“Less Is More for Agency”), a supervised fine-tuning technique that turns a base mannequin right into a succesful software program/analysis agent utilizing 78 samples. LIMI scores 73.5% common on AgencyBench (FTFC 71.7, RC@3 74.2, SR@3 74.6), beating sturdy baselines (GLM-4.5 45.1, Qwen3-235B-A22B 27.5, Kimi-K2 24.1, DeepSeek-V3.1 11.9) and even surpassing variants educated on 10,000 samples—with 128× much less information.

What precisely is new?
- Agency Efficiency Principle: LIMI state that agentic competence scales extra with information high quality/construction than uncooked pattern depend. The analysis workforce fine-tune GLM-4.5/GLM-4.5-Air on 78 long-horizon, tool-use trajectories (samples) and report giant positive aspects on AgencyBench and generalization suites (TAU2-bench, EvalPlus-HE/MBPP, DS-1000, SciCode). (*78*)
- Minimal however dense supervision. Each trajectory (~13k–152k tokens; ~42.4k avg.) captures full multi-turn workflows—mannequin reasoning, instrument calls, and atmosphere observations—collected within the SII-CLI execution atmosphere. Tasks span “vibe coding” (interactive software program improvement) and analysis workflows (search, evaluation, experiment design).(*78*)

How does it work?
- Base fashions: GLM-4.5 (355B) and GLM-4.5-Air (106B). Training makes use of the slime SFT framework with an identical configs throughout comparisons (to isolate information results).(*78*)
- Data building: 60 actual queries from practitioners + 18 synthesized from high-star GitHub PRs (tight QA by PhD annotators). For every question, LIMI logs the total agent trajectory to profitable completion inside SII-CLI.(*78*)
- Evaluation: AgencyBench (R=3 rounds) with FTFC, SR@3, RC@3; plus generalization suites (TAU2-airline/retail Pass^4, EvalPlus HE/MBPP, DS-1000, SciCode).(*78*)

Results
- AgencyBench (avg): 73.5%. LIMI vs. GLM-4.5 (+28.4 pts); FTFC 71.7% vs 37.8%; SR@3 74.6% vs 47.4%.(*78*)
- Data effectivity: LIMI (78 samples) outperforms GLM-4.5 educated on AFM-CodeAgent SFT (10,000 samples): 73.5% vs 47.8%—+53.7% absolute with 128× much less information. Similar gaps maintain vs AFM-WebAgent (7,610) and CC-Bench-Traj (260).(*78*)
- Generalization: Across tool-use/coding/scientific computing, LIMI averages ~57%, exceeding GLM-4.5 and different baselines; with out instrument entry, LIMI nonetheless leads barely (50.0% vs 48.7% for GLM-4.5), indicating intrinsic positive aspects past atmosphere tooling.(*78*)

Key Takeaways
- Data effectivity dominates scale. LIMI reaches 73.5% common on AgencyBench utilizing curated trajectories, surpassing GLM-4.5 (45.1%) and exhibiting a +53.7-point benefit over a 10k-sample SFT baseline—with 128× fewer samples.(*78*)
- Trajectory high quality, not bulk. Training information are long-horizon, tool-grounded workflows in collaborative software program improvement and scientific analysis, collected through the SII-CLI execution stack referenced by the paper.(*78*)
- Across-metric positive aspects. On AgencyBench, LIMI studies FTFC 71.7%, SR@3 74.6%, and robust RC@3, with detailed tables exhibiting giant margins over baselines; generalization suites (TAU2, EvalPlus-HE/MBPP, DS-1000, SciCode) common 57.2%. (*78*)
- Works throughout scales. Fine-tuning GLM-4.5 (355B) and GLM-4.5-Air (106B) each yields giant deltas over their bases, indicating technique robustness to mannequin measurement.(*78*)
Our Comments
The analysis workforce trains GLM-4.5 variants with 78 curated, long-horizon, tool-grounded trajectories captured in a CLI atmosphere spanning software-engineering and analysis duties. It studies 73.5% common on AgencyBench with FTFC, RC@3, and SR@3 metrics; baseline GLM-4.5 is reported at 45.1%. A comparability in opposition to a ten,000-sample AFM-CodeAgent SFT baseline exhibits 73.5% vs 47.8%; tool-free analysis signifies intrinsic positive aspects (≈50.0% for LIMI vs 48.7% GLM-4.5). Trajectories are multi-turn and token-dense, emphasizing planning, instrument orchestration, and verification.
Check out the Paper, GitHub Page and Model Card on HF. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.
The publish A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples appeared first on MarkTechPost.