|

Hexo Labs Open-Sources SIA: A Self-Improving Agent That Updates Both the Harness and the Model Weights

Most AI brokers cease enhancing as soon as a human stops tuning them. The mannequin is mounted. The scaffold round it’s mounted. Hexo Labs desires to maneuver each without delay. It launched SIA (Self-Improving AI) this week as an open-source framework beneath an MIT license.

The core declare of this analysis is slim however concrete. SIA edits each the agent’s scaffold and the mannequin’s weights inside one self-improving loop.

What is SIA (Self-Improving AI)

SIA splits a task-specific agent into two components. The first is the harness, additionally known as the scaffold. That covers the system immediate, tool-dispatch logic, retry coverage, and answer-extraction code. The second half is the mannequin weights themselves.

Three LLM elements drive the loop. A Meta-Agent writes the preliminary scaffold from a process specification and any reference code. A Task-Specific Agent runs the process and logs each step. A Feedback-Agent then reads that full trajectory and decides what to vary.

That determination is the key thought. After every run, the Feedback-Agent picks one in every of two actions. It can rewrite the scaffold whereas weights keep mounted. Or it might probably set off a weight replace whereas the scaffold stays mounted.

The base mannequin is openai/gpt-oss-120b. Weight updates use LoRA, a low-rank adapter, at rank 32. The Meta-Agent and Feedback-Agent each run on Claude Sonnet 4.6. Training runs on H100 GPUs by means of Modal, the staff’s RL platform.

The analysis staff labels its two working factors SIA-H and SIA-W+H. SIA-H makes use of harness updates solely. SIA-W+H provides weight updates on prime.

https://arxiv.org/pdf/2605.27276

The Benchmark Case

The analysis staff examined SIA on three intentionally completely different domains. The sample held throughout all three. Weight updates added features past what scaffold modifying alone reached. “Initial” is the base mannequin by means of the Meta-Agent’s first scaffold, earlier than any suggestions.

Task Initial Prev. SOTA SIA-H (harness solely) SIA-W+H (harness + weights)
LawBench (top-1 acc) 13.5% 45.0% 50.0% 70.1%
AlphaEvolve TriMul (reward) 0.105 1.292 0.120 1.475
Denoising (mse_norm) 0.048 0.240 0.241 0.289

On LawBench, the process is 191-class Chinese felony cost classification. Harness iteration constructed a TF-IDF plus LinearSVC pipeline and plateaued at 50.0%. Weight updates through PPO then pushed accuracy to 70.1%. That is a 20.1 percentage-point achieve over the harness-only finest.

The TriMul process asks for a customized CUDA kernel on an H100 GPU. The kernel computes a core operation in AlphaFold2’s Evoformer module. Scaffold edits reached a 1.14× speedup over baseline. Weight updates then drove runtime from 12,483 to 1,017 microseconds. That is a 91.9% discount from the harness-only peak.

One sincere caveat seems in the similar chart. The coding agent Claude Code reached 1.50× on TriMul unaided, beating SIA-H’s 1.14×. SIA-W+H nonetheless led total at 14.02×.

For denoising, the agent tunes MAGIC, a single-cell RNA imputation technique. Harness sweeps over its hyperparameters settled at 0.241 mse_norm. The first weight-update checkpoint added a two-line step that no scaffold produced. It rounded imputed counts to non-negative integers, lifting the rating to 0.289.

How the Feedback-Agent Picks Its Move

SIA doesn’t run one mounted RL recipe. The Feedback-Agent selects a coaching algorithm based mostly on the reward sign it observes.

On LawBench, the reward was a clear outcome-based scalar, so it used PPO with GAE. On TriMul, most kernels didn’t compile, so it used entropic benefit weighting. That technique up-weights uncommon high-reward rollouts. On denoising, it used GRPO, which eliminates the worth community fully.

The analysis staff additionally lists REINFORCE with KL-to-base, DPO, and best-of-N behavioural cloning. Each maps to a distinct reward form and failure threat.

Strengths and What to Watch

Strengths:

  • First system to edit each scaffold and weights in a single loop, per the authors’ comparability desk.
  • Consistent features over prior SOTA throughout three unrelated domains.
  • Open supply beneath MIT, installable as sia-agent, with 4 bundled duties.
  • Algorithm selection is conditioned on noticed rewards, not a hard and fast schedule.

What to Watch:

  • The analysis studies three duties; broader algorithm-selection outcomes are deferred.
  • Both levers optimise the similar mounted verifier, risking coupled Goodhart results.
  • The analysis warn the joint mounted level could also be fragile beneath perturbation.

Marktechpost’s Visual Explainer

Hexo Labs · Open Source (MIT)
SIA: Self-Improving AI
Harness + Weight Updates

A self-improving loop that edits each an agent’s scaffold and its mannequin weights, with out additional human tuning.

gpt-oss-120b
LoRA rank 32
3 benchmarks
Claude Sonnet 4.6 brokers

The Gap

Two silos, working in isolation

Harness faculty

Edit the scaffold

A meta-agent rewrites prompts, instruments, and retry logic. The mannequin weights keep mounted.

Test-time coaching

Edit the weights

An RL pipeline updates the mannequin on process suggestions. The harness stays mounted.


SIA closes the hole by shifting each levers inside one loop.

Anatomy

What SIA really is

  • Harness (scaffold): the system immediate, tool-dispatch logic, retry coverage, and answer-extraction code.
  • Weights: the mannequin’s personal parameters, tailored with LoRA at rank 32.
  • Three LLM elements drive the loop: a Meta-Agent, a Task-Specific Agent, and a Feedback-Agent.

The Loop

One loop, two levers

After every run, the Feedback-Agent reads the full trajectory and picks one motion.

Action A

Harness replace

Rewrite the scaffold. Weights are held mounted.

Action B

Weight replace

Train LoRA weights. The scaffold is held mounted.


The two levers interleave freely, not in locked sequential phases.

Evidence

Benchmark outcomes

Task Initial Prev. SOTA SIA-H SIA-W+H
LawBench (top-1 acc) 13.5% 45.0% 50.0% 70.1%
AlphaEvolve TriMul (reward) 0.105 1.292 0.120 1.475
Denoising (mse_norm) 0.048 0.240 0.241 0.289

SIA-W+H (harness + weights) beat SIA-H (harness solely) on all three duties.

Mechanism

How the Feedback-Agent picks its transfer

  • LawBench: a clear outcome-based reward, so it used PPO with GAE. Accuracy reached 70.1%.
  • TriMul: most kernels fail to compile, so it used entropic benefit weighting. Runtime hit 1,017 µs.
  • Denoising: it used GRPO, which eliminates the worth community. Score rose to 0.289.
  • Also accessible: REINFORCE + KL-to-base, DPO, and best-of-N behavioural cloning.

RQ2

What every lever adjustments

Harness

Externalised adjustments

Software-engineering enhancements: new instruments, tighter parsers, retry logic.

Weights

Internalised information

Domain information no immediate reaches: H100 kernel patterns, an integer-rounding step.


The harness shapes how the agent searches; weight updates change what the mannequin is aware of.

The Honest Read

Limitations to maintain in view

  • Both levers optimise the similar mounted verifier, risking a coupled co-evolutionary Goodhart impact.
  • Fixed factors can look sturdy on the verifier but keep fragile beneath perturbation.
  • The paper studies three duties; broader algorithm-selection outcomes are deferred.
  • A separate 350× superintelligence declare in launch protection doesn’t seem in the paper.

Get Started

Run it your self

Open supply beneath MIT at hexo-ai/sia. Built on gpt-oss-120b with LoRA rank 32.

# set up the Claude backend
pip set up 'sia-agent[claude]'
export ANTHROPIC_API_KEY="..."

# run 5 self-improvement generations on a bundled process
sia --task lawbench --max_gen 5 --run_id 1

Four bundled duties ship in the field: gpqa, lawbench, longcot-chess, spaceship-titanic.


01 / 09

Source: Hebbar et al., SIA: Self Improving AI with Harness & Weight Updates (arXiv:2605.27276)
github.com/hexo-ai/sia

Key Takeaways

  • SIA is the first self-improving loop that edits each an agent's scaffold and its mannequin weights.
  • A Feedback-Agent reads every run's full trajectory, then picks a harness rewrite or weight replace.
  • Combining each levers beat scaffold-only on all three duties: LawBench, TriMul kernels, scRNA-seq denoising.
  • Harness edits add software-engineering hygiene; weight updates floor area information no immediate reaches.
  • Open supply beneath MIT (hexo-ai/sia), constructed on gpt-oss-120b with LoRA rank 32.


Check out the Repo and Research PaperAlso, be at liberty to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish Hexo Labs Open-Sources SIA: A Self-Improving Agent That Updates Both the Harness and the Model Weights appeared first on MarkTechPost.

Similar Posts