A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples

Do curated, tool-grounded demonstrations construct stronger software program brokers than broad piles of generic instruction information? A workforce of researchers from Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) proposes LIMI (“Less Is More for Agency”), a supervised fine-tuning technique that turns a base mannequin right into a succesful software program/analysis agent utilizing 78 samples. LIMI scores 73.5% common on AgencyBench (FTFC 71.7, RC@3 74.2, SR@3 74.6), beating sturdy baselines (GLM-4.5 45.1, Qwen3-235B-A22B 27.5, Kimi-K2 24.1, DeepSeek-V3.1 11.9) and even surpassing variants educated on 10,000 samples—with 128× much less information.

What precisely is new?

Agency Efficiency Principle: LIMI state that agentic competence scales extra with information high quality/construction than uncooked pattern depend. The analysis workforce fine-tune GLM-4.5/GLM-4.5-Air on 78 long-horizon, tool-use trajectories (samples) and report giant positive aspects on AgencyBench and generalization suites (TAU2-bench, EvalPlus-HE/MBPP, DS-1000, SciCode). (*78*)
Minimal however dense supervision. Each trajectory (~13k–152k tokens; ~42.4k avg.) captures full multi-turn workflows—mannequin reasoning, instrument calls, and atmosphere observations—collected within the SII-CLI execution atmosphere. Tasks span “vibe coding” (interactive software program improvement) and analysis workflows (search, evaluation, experiment design).(*78*)

How does it work?

Base fashions: GLM-4.5 (355B) and GLM-4.5-Air (106B). Training makes use of the slime SFT framework with an identical configs throughout comparisons (to isolate information results).(*78*)
Data building: 60 actual queries from practitioners + 18 synthesized from high-star GitHub PRs (tight QA by PhD annotators). For every question, LIMI logs the total agent trajectory to profitable completion inside SII-CLI.(*78*)
Evaluation: AgencyBench (R=3 rounds) with FTFC, SR@3, RC@3; plus generalization suites (TAU2-airline/retail Pass^4, EvalPlus HE/MBPP, DS-1000, SciCode).(*78*)

Results

AgencyBench (avg): 73.5%. LIMI vs. GLM-4.5 (+28.4 pts); FTFC 71.7% vs 37.8%; SR@3 74.6% vs 47.4%.(*78*)
Data effectivity: LIMI (78 samples) outperforms GLM-4.5 educated on AFM-CodeAgent SFT (10,000 samples): 73.5% vs 47.8%—+53.7% absolute with 128× much less information. Similar gaps maintain vs AFM-WebAgent (7,610) and CC-Bench-Traj (260).(*78*)
Generalization: Across tool-use/coding/scientific computing, LIMI averages ~57%, exceeding GLM-4.5 and different baselines; with out instrument entry, LIMI nonetheless leads barely (50.0% vs 48.7% for GLM-4.5), indicating intrinsic positive aspects past atmosphere tooling.(*78*)

Key Takeaways

Data effectivity dominates scale. LIMI reaches 73.5% common on AgencyBench utilizing curated trajectories, surpassing GLM-4.5 (45.1%) and exhibiting a +53.7-point benefit over a 10k-sample SFT baseline—with 128× fewer samples.(*78*)
Trajectory high quality, not bulk. Training information are long-horizon, tool-grounded workflows in collaborative software program improvement and scientific analysis, collected through the SII-CLI execution stack referenced by the paper.(*78*)
Across-metric positive aspects. On AgencyBench, LIMI studies FTFC 71.7%, SR@3 74.6%, and robust RC@3, with detailed tables exhibiting giant margins over baselines; generalization suites (TAU2, EvalPlus-HE/MBPP, DS-1000, SciCode) common 57.2%. (*78*)
Works throughout scales. Fine-tuning GLM-4.5 (355B) and GLM-4.5-Air (106B) each yields giant deltas over their bases, indicating technique robustness to mannequin measurement.(*78*)

Our Comments

The analysis workforce trains GLM-4.5 variants with 78 curated, long-horizon, tool-grounded trajectories captured in a CLI atmosphere spanning software-engineering and analysis duties. It studies 73.5% common on AgencyBench with FTFC, RC@3, and SR@3 metrics; baseline GLM-4.5 is reported at 45.1%. A comparability in opposition to a ten,000-sample AFM-CodeAgent SFT baseline exhibits 73.5% vs 47.8%; tool-free analysis signifies intrinsic positive aspects (≈50.0% for LIMI vs 48.7% GLM-4.5). Trajectories are multi-turn and token-dense, emphasizing planning, instrument orchestration, and verification.

Check out the Paper, GitHub Page and Model Card on HF. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

The publish A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples appeared first on MarkTechPost.

A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples

What precisely is new?

How does it work?

Results

Key Takeaways

Our Comments

Comparing the Top 5 AI Agent Architectures in 2025: Hierarchical, Swarm, Meta Learning, Modular, Evolutionary

Getting Started with Agent Communication Protocol (ACP): Build a Weather Agent with Python

Microsoft Brings MCP to Azure Logic Apps (Standard) in Public Preview, Turning Connectors into Agent Tools

MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation

A Coding Guide to Build an Autonomous Agentic AI for Time Series Forecasting with Darts and Hugging Face

How to Build a Fully Offline Multi-Tool Reasoning Agent with Dynamic Planning, Error Recovery, and Intelligent Function Routing

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What precisely is new?

How does it work?

Results

Key Takeaways

Our Comments

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!