Hexo Labs Open-Sources SIA: A Self-Improving Agent That Updates Both the Harness and the Model Weights
Most AI brokers cease enhancing as soon as a human stops tuning them. The mannequin is mounted. The scaffold round it’s mounted. Hexo Labs desires to maneuver each without delay. It launched SIA (Self-Improving AI) this week as an open-source framework beneath an MIT license.
The core declare of this analysis is slim however concrete. SIA edits each the agent’s scaffold and the mannequin’s weights inside one self-improving loop.
What is SIA (Self-Improving AI)
SIA splits a task-specific agent into two components. The first is the harness, additionally known as the scaffold. That covers the system immediate, tool-dispatch logic, retry coverage, and answer-extraction code. The second half is the mannequin weights themselves.
Three LLM elements drive the loop. A Meta-Agent writes the preliminary scaffold from a process specification and any reference code. A Task-Specific Agent runs the process and logs each step. A Feedback-Agent then reads that full trajectory and decides what to vary.
That determination is the key thought. After every run, the Feedback-Agent picks one in every of two actions. It can rewrite the scaffold whereas weights keep mounted. Or it might probably set off a weight replace whereas the scaffold stays mounted.
The base mannequin is openai/gpt-oss-120b. Weight updates use LoRA, a low-rank adapter, at rank 32. The Meta-Agent and Feedback-Agent each run on Claude Sonnet 4.6. Training runs on H100 GPUs by means of Modal, the staff’s RL platform.
The analysis staff labels its two working factors SIA-H and SIA-W+H. SIA-H makes use of harness updates solely. SIA-W+H provides weight updates on prime.

The Benchmark Case
The analysis staff examined SIA on three intentionally completely different domains. The sample held throughout all three. Weight updates added features past what scaffold modifying alone reached. “Initial” is the base mannequin by means of the Meta-Agent’s first scaffold, earlier than any suggestions.
| Task | Initial | Prev. SOTA | SIA-H (harness solely) | SIA-W+H (harness + weights) |
|---|---|---|---|---|
| LawBench (top-1 acc) | 13.5% | 45.0% | 50.0% | 70.1% |
| AlphaEvolve TriMul (reward) | 0.105 | 1.292 | 0.120 | 1.475 |
| Denoising (mse_norm) | 0.048 | 0.240 | 0.241 | 0.289 |
On LawBench, the process is 191-class Chinese felony cost classification. Harness iteration constructed a TF-IDF plus LinearSVC pipeline and plateaued at 50.0%. Weight updates through PPO then pushed accuracy to 70.1%. That is a 20.1 percentage-point achieve over the harness-only finest.
The TriMul process asks for a customized CUDA kernel on an H100 GPU. The kernel computes a core operation in AlphaFold2’s Evoformer module. Scaffold edits reached a 1.14× speedup over baseline. Weight updates then drove runtime from 12,483 to 1,017 microseconds. That is a 91.9% discount from the harness-only peak.
One sincere caveat seems in the similar chart. The coding agent Claude Code reached 1.50× on TriMul unaided, beating SIA-H’s 1.14×. SIA-W+H nonetheless led total at 14.02×.
For denoising, the agent tunes MAGIC, a single-cell RNA imputation technique. Harness sweeps over its hyperparameters settled at 0.241 mse_norm. The first weight-update checkpoint added a two-line step that no scaffold produced. It rounded imputed counts to non-negative integers, lifting the rating to 0.289.
How the Feedback-Agent Picks Its Move
SIA doesn’t run one mounted RL recipe. The Feedback-Agent selects a coaching algorithm based mostly on the reward sign it observes.
On LawBench, the reward was a clear outcome-based scalar, so it used PPO with GAE. On TriMul, most kernels didn’t compile, so it used entropic benefit weighting. That technique up-weights uncommon high-reward rollouts. On denoising, it used GRPO, which eliminates the worth community fully.
The analysis staff additionally lists REINFORCE with KL-to-base, DPO, and best-of-N behavioural cloning. Each maps to a distinct reward form and failure threat.
Strengths and What to Watch
Strengths:
- First system to edit each scaffold and weights in a single loop, per the authors’ comparability desk.
- Consistent features over prior SOTA throughout three unrelated domains.
- Open supply beneath MIT, installable as sia-agent, with 4 bundled duties.
- Algorithm selection is conditioned on noticed rewards, not a hard and fast schedule.
What to Watch:
- The analysis studies three duties; broader algorithm-selection outcomes are deferred.
- Both levers optimise the similar mounted verifier, risking coupled Goodhart results.
- The analysis warn the joint mounted level could also be fragile beneath perturbation.
Marktechpost’s Visual Explainer
01 / 09
github.com/hexo-ai/sia
Key Takeaways
- SIA is the first self-improving loop that edits each an agent's scaffold and its mannequin weights.
- A Feedback-Agent reads every run's full trajectory, then picks a harness rewrite or weight replace.
- Combining each levers beat scaffold-only on all three duties: LawBench, TriMul kernels, scRNA-seq denoising.
- Harness edits add software-engineering hygiene; weight updates floor area information no immediate reaches.
- Open supply beneath MIT (hexo-ai/sia), constructed on gpt-oss-120b with LoRA rank 32.
Check out the Repo and Research Paper. Also, be at liberty to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The publish Hexo Labs Open-Sources SIA: A Self-Improving Agent That Updates Both the Harness and the Model Weights appeared first on MarkTechPost.
