DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds
DeepReinforce has launched Ornith-1.0, an open-source mannequin household constructed for agentic coding. The lineup spans 4 sizes, from a 9B dense mannequin to a 397B mixture-of-experts flagship. Every checkpoint ships beneath the MIT license on Hugging Face. The fashions are post-trained on high of pretrained Gemma 4 and Qwen 3.5.
Most coding brokers pair a mannequin with a set, human-designed harness. Ornith-1.0 as an alternative learns to jot down its personal. The DeepReinforce analysis group stories state-of-the-art outcomes amongst open fashions of comparable dimension.
TL;DR
- Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes beneath MIT, constructed on Gemma 4 and Qwen 3.5.
- The mannequin learns its personal scaffold throughout RL, collectively optimizing the harness and the answer.
- Ornith-1.0-397B tops Claude Opus 4.7 on each headline benchmarks, however not Opus 4.8 or the bigger GLM-5.2-744B.
- Three layers — fastened belief boundary, deterministic monitor, frozen LLM decide — guard in opposition to reward hacking.
What is Ornith-1.0?
Ornith-1.0 is a set of reasoning fashions tuned for coding brokers. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B mannequin is mixture-of-experts and prompts roughly 3B parameters per token. FP8 and GGUF builds are additionally printed for quicker native serving.
Each mannequin is a reasoning mannequin. Replies open with a <suppose> block earlier than the ultimate reply. The serving recipes allow a reasoning parser, in order that hint returns in a separate reasoning_content subject. The fashions additionally emit well-formed device requires agent loops.
Deployment is simple. The 9B mannequin is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes goal vLLM, SGLang, and Transformers. Each mannequin exposes an OpenAI-compatible endpoint. Standard agent frameworks subsequently work with out code adjustments.
Interactive Explainer
resize();
});
});
/* loop sim */
var step=0,reward=0.08,timer=null;
var scaffs=[
‘Baseline harness: linear retries, no memory.’,
‘Adds scratchpad memory across tool calls.’,
‘Adds error-triage branch before re-edit.’,
‘Reorders: read tests, then plan, then patch.’,
‘Caches sub-results; prunes dead branches.’,
‘Task-specific orchestration emerges automatically.’];
var outs=[
‘Fixed harness, no learning yet.’,
‘Fewer redundant file reads observed.’,
‘Recovers from failed edits more often.’,
‘Higher first-pass test success.’,
‘Shorter trajectories, same accuracy.’,
‘Stable high-reward scaffold selected.’];
var nodes=root.querySelectorAll(‘.node’);
operate lightSeq(cb){
var i=0;nodes.forEach(operate(n){n.classList.take away(‘act’)});
var iv=setInterval(operate(){
nodes.forEach(operate(n){n.classList.take away(‘act’)});
nodes[i].classList.add(‘act’);i++;
if(i>=nodes.size){clearInterval(iv);setTimeout(operate(){nodes.forEach(operate(n){n.classList.take away(‘act’)});cb&&cb();},260);}
},220);
}
operate doStep(){
if(step>=5){return;}
step++;
lightSeq(operate(){
reward=[0.08,0.27,0.43,0.58,0.69,0.77][step];
root.querySelector(‘#rFill’).model.width=(reward*100)+’%’;
root.querySelector(‘#rVal’).textContent=reward.toFixed(2);
root.querySelector(‘#scaffTxt’).textContent=scaffs[step];
root.querySelector(‘#outTxt’).textContent=outs[step];
root.querySelector(‘#stepOut’).innerHTML=’Step ‘+step+’ — <b>scaffold mutated</b>; reward propagated to each phases.’;
resize();
});
}
root.querySelector(‘#stepBtn’).addEventListener(‘click on’,doStep);
root.querySelector(‘#autoBtn’).addEventListener(‘click on’,operate(){
if(timer){clearInterval(timer);timer=null;this.textContent=’Auto-run
‘;return;}
this.textContent=’Pause
‘;var b=this;
timer=setInterval(operate(){if(step>=5){clearInterval(timer);timer=null;b.textContent=’Auto-run
‘;}else{doStep();}},1400);
});
root.querySelector(‘#resetBtn’).addEventListener(‘click on’,operate(){
if(timer){clearInterval(timer);timer=null;root.querySelector(‘#autoBtn’).textContent=’Auto-run
‘;}
step=0;reward=0.08;
root.querySelector(‘#rFill’).model.width=’8%’;
root.querySelector(‘#rVal’).textContent=’0.08′;
root.querySelector(‘#scaffTxt’).textContent=scaffs[0];
root.querySelector(‘#outTxt’).textContent=’Press “Run coaching step” to start.’;
root.querySelector(‘#stepOut’).innerHTML=’Step 0 — untrained coverage with a set, hand-written harness.’;
resize();
});
/* benchmark information (vendor-reported) */
var BENCHES=[‘Terminal-Bench 2.1′,’SWE-Bench Verified’,’SWE-Bench Pro’,’SWE-Bench Multilingual’,’NL2Repo’,’ClawEval Avg’];
var DATA={
t397:{label:’Ornith-1.0-397B’,hero:’Ornith-1.0-397B’,
fashions:[‘Ornith-1.0-397B’,’Qwen3.5-397B’,’Qwen3.7-Max’,’GLM-5.2-744B’,’Minimax-M3-428B’,’DeepSeek-V4-Pro-1.6T’,’Claude Opus 4.7′,’Claude Opus 4.8′],
vals:[[77.5,53.5,73.5,81.0,64,64,70.3,85],[82.4,76.4,80.4,null,null,80.6,80.8,87.6],[62.2,51.6,60.6,62.1,59,55.4,64.3,69.2],[78.9,69.3,78.3,null,null,76.2,null,null],[48.2,36.8,47.2,48.9,42.1,null,null,69.7],[77.1,70.7,65.2,null,null,75.8,78.2,null]]},
t35:{label:’Ornith-1.0-35B-A3B’,hero:’Ornith-1.0-35B-A3B’,
fashions:[‘Ornith-1.0-35B-A3B’,’Qwen3.5-35B-A3B’,’Qwen3.6-35B-A3B’,’Gemma4-31B’,’Qwen3.5-397B’],
vals:[[64.2,41.4,52.5,42.1,53.5],[75.6,70,73.4,52,76.4],[50.4,44.6,49.5,35.7,51.6],[69.3,60.3,67.2,51.7,69.3],[34.6,20.5,29.4,15.5,36.8],[69.8,65.4,68.7,48.5,70.7]]},
t9:{label:’Ornith-1.0-9B’,hero:’Ornith-1.0-9B’,
fashions:[‘Ornith-1.0-9B’,’Qwen3.5-9B’,’Qwen3.5-35B-A3B’,’Gemma4-12B’,’Gemma4-31B’],
vals:[[43.1,21.3,41.4,21,42.1],[69.4,53.2,70,44.2,52],[42.9,31.3,44.6,27.6,35.7],[52,39.7,60.3,32.5,51.7],[27.2,16.2,20.5,10.3,15.5],[63.1,53.2,65.4,32.5,48.5]]}
};
var curTier=’t397′,curB=0;
var bchips=root.querySelector(‘#benchChips’);
BENCHES.forEach(operate(b,i){
var c=doc.createElement(‘div’);c.className=’chip’+(i===0?’ on’:”);c.textContent=b;c.dataset.b=i;
c.addEventListener(‘click on’,operate(){curB=i;bchips.querySelectorAll(‘.chip’).forEach(operate(x){x.classList.take away(‘on’)});c.classList.add(‘on’);draw();});
bchips.appendChild(c);
});
root.querySelectorAll(‘.chip[data-tier]’).forEach(operate(c){
c.addEventListener(‘click on’,operate(){curTier=c.dataset.tier;root.querySelectorAll(‘.chip[data-tier]’).forEach(operate(x){x.classList.take away(‘on’)});c.classList.add(‘on’);draw();});
});
operate draw(){
var d=DATA[curTier];var row=d.vals[curB];var chart=root.querySelector(‘#chart’);chart.innerHTML=”;
var max=Math.max.apply(null,row.filter(operate(v){return v!=null}));
d.fashions.forEach(operate(m,i){
var v=row[i];var hero=(m===d.hero);
var div=doc.createElement(‘div’);div.className=’row’+(hero?’ hero’:”)+(v==null?’ na’:”);
div.innerHTML='<div class=”nm”>’+m+'</div><div class=”bt”><div class=”bf”></div></div><div class=”vl”>’+(v==null?’n/a’:v)+'</div>’;
chart.appendChild(div);
(operate(bf,val){setTimeout(operate(){bf.model.width=(val==null?0:(val/max*100))+’%’;},40);})(div.querySelector(‘.bf’),v);
});
root.querySelector(‘#benchNote’).textContent=’Benchmark: ‘+BENCHES[curB]+’. Bars scaled to the best rating proven. “n/a” = not reported by the seller. Self-reported, not independently verified.’;
resize();
}
draw();
/* defenses accordion */
root.querySelectorAll(‘.layer’).forEach(operate(l){
l.addEventListener(‘click on’,operate(){l.classList.toggle(‘open’);resize();});
});
/* auto-resize for PhrasePress iframe */
operate resize(){
strive{
var h=root.offsetHeight+40;
if(window.dad or mum){window.dad or mum.postMessage({kind:’mtp-ornith-height’,peak:h},’*’);}
}catch(e){}
}
window.addEventListener(‘load’,resize);
setTimeout(resize,300);
window.addEventListener(‘resize’,resize);
})();
</script>
</div>
” model=”width:100%;border:0;show:block;min-height:600px;overflow:hidden” peak=”600″ scrolling=”no” loading=”lazy” title=”Ornith-1.0 Interactive Explainer”>
The Self-Scaffolding Idea
Most coding brokers depend on a scaffold, additionally known as a harness. A scaffold wraps the mannequin with reminiscence, instruments, error dealing with, and orchestration logic. AI groups often hand-design one scaffold per job class.
Ornith-1.0 treats the scaffold as a learnable object as an alternative. During reinforcement studying, the scaffold co-evolves with the mannequin’s coverage. Each RL step runs in two phases.
First, the mannequin reads the duty and its earlier scaffold. It then proposes a refined scaffold. Second, it makes use of that scaffold and the duty to generate an answer rollout. Reward from the rollout flows again to each phases.
So the mannequin is optimized to writer orchestration, not simply solutions. Over coaching, higher-reward scaffolds are mutated and chosen robotically. Per-task methods emerge with out hand-engineered harness design.
Training additionally runs asynchronously, utilizing a pipeline-RL setup. A staleness weight downweights older, off-policy tokens and drops them previous a threshold. The optimization makes use of a token-level GRPO goal.
Guarding Against Reward Hacking
Letting a mannequin write its personal scaffold invitations reward hacking. A scaffold might learn seen take a look at recordsdata and hardcode anticipated outputs. It might additionally copy an oracle resolution sitting within the setting. DeepReinforce group describes three protection layers.
- The outer belief boundary is fastened and immutable. The setting, device floor, and take a look at isolation keep outdoors the mannequin’s attain. The mannequin evolves solely its internal coverage scaffold.
- A deterministic monitor flags banned actions. Reading withheld paths or enhancing verification scripts earns zero reward. Those trajectories are excluded from the benefit computation.
- A frozen LLM decide acts as a veto. It sits on high of the verifier, not as the first reward.
Benchmark
DeepReinforce stories vendor numbers throughout a number of agentic coding benchmarks. At flagship scale, Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified. On SWE-Bench Verified, that 82.4 trails solely Claude Opus 4.8 (87.6) among the many listed fashions. On Terminal-Bench 2.1, the image is extra blended.
Ornith-1.0-397B beats Claude Opus 4.7 (70.3) on Terminal-Bench 2.1. But it trails Claude Opus 4.8 (85) and the bigger GLM-5.2-744B (81.0). So the ‘state-of-the-art’ declare is scoped to open fashions of comparable dimension.
The smaller fashions carry the effectivity case. The 35B mannequin scores 64.2 on Terminal-Bench 2.1, above Qwen 3.5-397B’s 53.5. The 9B mannequin reaches 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified.
| Benchmark | Ornith-1.0-397B | Qwen3.5-397B | Qwen3.7-Max | GLM-5.2-744B | Minimax-M3-428B | DeepSeek-V4-Pro-1.6T | Claude Opus 4.7 | Claude Opus 4.8 |
|---|---|---|---|---|---|---|---|---|
| Terminal-Bench 2.1 | 77.5 | 53.5 | 73.5 | 81.0 | 64 | 64 | 70.3 | 85 |
| SWE-Bench Verified | 82.4 | 76.4 | 80.4 | – | – | 80.6 | 80.8 | 87.6 |
| SWE-Bench Pro | 62.2 | 51.6 | 60.6 | 62.1 | 59 | 55.4 | 64.3 | 69.2 |
| SWE-Bench Multilingual | 78.9 | 69.3 | 78.3 | – | – | 76.2 | – | – |
| NL2Repo | 48.2 | 36.8 | 47.2 | 48.9 | 42.1 | – | – | 69.7 |
| ClawEval Avg | 77.1 | 70.7 | 65.2 | – | – | 75.8 | 78.2 | – |
Use Cases and a Quick Start
The fashions goal terminal-native coding brokers and repository-scale work. Practical suits embody multi-file refactors, bug localization, and test-driven patches. The 9B mannequin fits edge or single-GPU setups the place latency and value matter. The 397B mannequin targets most accuracy on lengthy, multi-step duties.
For instance, a dev can run the 9B mannequin regionally to triage a failing take a look at suite. A platform group can self-host the 397B mannequin for an inside coding agent.
Serving is a one-liner with vLLM:
vllm serve deepreinforce-ai/Ornith-1.0-9B
--served-model-name Ornith-1.0-9B
--max-model-len 262144
--enable-auto-tool-choice --tool-call-parser qwen3_xml
--reasoning-parser qwen3
--trust-remote-code
Then name it with any OpenAI shopper:
from openai import OpenAI
shopper = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = shopper.chat.completions.create(
mannequin="Ornith-1.0-9B",
messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],
temperature=0.6, top_p=0.95,
)
msg = resp.decisions[0].message
print(getattr(msg, "reasoning_content", None)) # the <suppose> hint
print(msg.content material) # the ultimate reply
The reasoning hint returns in reasoning_content, with the reply in content material. Recommended sampling is temperature=0.6, top_p=0.95, top_k=20. The mannequin additionally plugs into OpenArms, OpenClaw, and OpenCode.
Check out the Model Weights and Technical details. Also, be at liberty to comply with us on Twitter and don’t neglect to affix our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds appeared first on MarkTechPost.
