|

Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b

Most search brokers are educated as insurance policies over a rising transcript. The mannequin decides search. It should additionally bear in mind what it noticed, which proof issues, and which claims it checked. A staff of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks an excessive amount of. Reinforcement studying finally ends up optimizing each search choices and routine bookkeeping without delay.

Their reply is Harness-1, a 20B retrieval subagent constructed on gpt-oss-20b. It was educated with reinforcement studying inside a stateful search harness. The harness holds the bookkeeping. The coverage retains the semantic choices. The weights and harness code are publicly launched.

https://arxiv.org/pdf/2606.02373

What is Harness-1 Actually

Harness-1 produces a ranked set of paperwork for a downstream answering mannequin. It doesn’t reply questions itself. It runs inside a state-machine harness centered on a per-episode WORKINGMEMORY.

Each flip works as a loop. The harness renders compact search state together with latest actions. The mannequin emits one structured motion. The harness executes it, updates state, and renders the subsequent statement.

The Stateful Harness: What Moves Out of the Policy

The analysis staff calls its precept stateful cognitive offloading. The coverage decides what to look, curate, and confirm, and when to cease. The harness maintains the recoverable state round these choices.

That state contains a number of items. A candidate pool holds compressed, deduplicated paperwork. An importance-tagged curated set is the ultimate output, capped at 30 paperwork. Tags take 4 values: very_high, excessive, honest, or low. A full-text retailer retains each retrieved chunk exterior the immediate.

An proof graph provides construction. A regex extractor scans every chunk for correct nouns, years, and dates. The harness then renders frequent entities, bridge paperwork, and singletons. Bridge paperwork include two or extra frequent entities. Singletons seem in a single doc and recommend follow-up leads.

The coverage works by means of eight instruments. These are fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, confirm, and end_search. Search outputs are compressed with sentence-BM25, holding the highest 4 sentences. Two-level deduplication removes repeats by chunk ID and content material fingerprint.

One design selection addresses chilly begins. The first profitable search auto-seeds the curated set with eight reranked outcomes at honest significance. The coverage then promotes robust paperwork and removes weak ones. This turns the duty from constructing from scratch into refinement.

The analysis staff names three necessities for a trainable harness. These are warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 implements all three.

How It is Trained

Training splits alongside the identical line because the harness. Supervised fine-tuning teaches the mannequin to function the interface. Reinforcement studying improves search choices over the maintained state.

A single instructor, GPT-5.4, runs dwell inside the complete harness. After filtering, 899 trajectories stay for SFT. The mannequin makes use of LoRA at rank 32 for 3 epochs. The step-550 checkpoint initializes RL.

RL makes use of on-policy CISPO with a 40-turn cap and terminal-only reward. It trains solely on SEC queries. Groups with an identical rewards are dropped from the gradient. Training ran on Tinker.

The reward separates discovery from choice. It additionally provides a tool-diversity bonus. Without that bonus, the agent collapses to repeated search. Curated recall then plateaus close to 0.53. With the bonus, range stabilizes and recall reaches about 0.60.

The Benchmark Case

Harness-1 was evaluated on eight benchmarks spanning net, finance, patents, and multi-hop QA. The major metric is curated recall: protection of related paperwork within the remaining set. Trajectory recall counts proof encountered anyplace within the episode.

Model Type Avg Curated Recall Avg Trajectory Recall
Harness-1 (20B) Open small 0.730 0.807
Tongyi DeepResearch 30B Open small 0.616 0.673
Context-1 (20B) Open small 0.603 0.756
Search-R1 (32B) Open small 0.289 0.289
GPT-OSS-20B Open small 0.262 0.590
Qwen3 (32B) Open small 0.216 0.446
Opus-4.6 Frontier 0.764 0.794
GPT-5.4 Frontier 0.709 0.752
Sonnet-4.6 Frontier 0.688 0.725
Kimi-K2.5 Frontier 0.647 0.794
GPT-OSS-120B Frontier 0.496 0.769
Averages throughout eight benchmarks, from Figure 1 of the paper. Frontier fashions run as zero-shot retrievers underneath the Context-1 harness.

Harness-1 reaches 0.730 common curated recall. That beats the subsequent open subagent, Tongyi DeepResearch 30B, by 11.4 factors. Among the frontier searchers examined, solely Opus-4.6 scores larger on common.

The switch sample is the clearest sign of the mechanism. SFT used 4 benchmark households; RL used solely SEC. On these source-family duties, Harness-1 gained 7.9 factors over the closest open baseline. On 4 held-out benchmarks, it gained 17.0 factors. That is a 2.2x bigger acquire on duties furthest from coaching knowledge.

Ablations help the harness declare. Disabling all harness mechanisms drops Recall by 12.2 % relative on BrowseComp+. The educated coverage retains looking out however can not rank what it sees.

https://arxiv.org/pdf/2606.02373

Use Cases

The technique targets evidence-seeking retrieval the place paperwork help a solution. Several workflows match this form.

One is literature and patent evaluation. The proof graph and curated set assist manage many sources. Another is financial-filing evaluation. The SEC case research recovers an actual executive-transition date throughout a number of 8-Ks.

A third is multi-hop fact-checking. The fan_out_search and confirm instruments resolve ambiguous entities earlier than committing. A fourth is modular RAG. The curated set feeds a frozen generator, and higher units yield larger reply accuracy.

Strengths and Weaknesses

Strengths

  • Highest common curated recall among the many open fashions examined, and behind solely Opus-4.6 total.
  • Gains maintain on held-out benchmarks, suggesting domain-general search operations.
  • Trained on 4,352 distinctive gadgets, far fewer than a number of baselines.
  • Open checkpoint and harness code, servable with widespread runtimes.

Weaknesses

  • The proof graph makes use of regex extraction, not full entity linking.
  • The confirm instrument is an LLM proxy that may err on ambiguous claims.
  • Sentence-BM25 compression might drop context tied to discourse construction.
  • The analysis staff stories level estimates with out full confidence intervals.

Key Takeaways

  • Harness-1 is a 20B search agent that strikes search bookkeeping into the atmosphere, leaving semantic choices to the coverage.
  • It hits 0.730 common curated recall throughout eight benchmarks, beating the subsequent open subagent by 11.4 factors.
  • Among the searchers examined, solely Opus-4.6 scores larger on common curated recall.
  • Gains are largest on held-out benchmarks (+17.0 vs +7.9 factors), suggesting the discovered search operations switch.
  • Weights and harness code are public, servable through vLLM, SGLang, or Transformers.

Marktechpost’s Visual Explainer

Stateful Search Agents
1 / 7
Research Guide

Harness-1: a 20B search agent with a stateful harness

A retrieval subagent educated with reinforcement studying inside a search harness that holds the bookkeeping.

20B · gpt-oss-20b base
UIUC · UC Berkeley · Chroma
arXiv:2606.02373
Open weights & code
The Core Idea

Split the work between coverage and harness

Most search brokers pack search choices and routine bookkeeping into one rising transcript. Harness-1 separates the 2. The paper calls this stateful cognitive offloading.

Policy decides
  • What to look
  • Which paperwork to maintain
  • What claims to confirm
  • When to cease
Harness maintains
  • Candidate pool
  • Curated proof
  • Verification information
  • Context finances

Inside the Harness

Environment-side working reminiscence

  • Candidate pool — compressed, deduplicated paperwork
  • Curated set — importance-tagged, capped at 30 (very_high / excessive / honest / low)
  • Evidence graph — entities, bridges, and singletons through regex extraction
  • Verification cache — declare to doc to sure/no verdict
  • Full-text retailer — each retrieved chunk stored exterior the immediate
  • Compression — sentence-BM25 retains the highest 4 sentences
Policy Actions

Eight instruments edit the state

fan_out_search
search_corpus
grep_corpus
read_document
review_docs
curate
confirm
end_search

The first profitable search auto-seeds the curated set with eight reranked paperwork at honest significance. The coverage then promotes robust paperwork and removes weak ones.
Training

SFT to function the interface, RL to look

SFT: GPT-5.4 instructor contained in the harness · 899 trajectories · LoRA rank 32 · step-550 checkpoint
RL: on-policy CISPO · SEC queries solely · 40-turn cap · terminal reward · educated on Tinker
Data scale: 4,352 distinctive coaching gadgets (899 SFT + 3,453 RL)

Three trainability necessities: warm-started curation, compact derived-state rendering, and diversity-preserving incentives.
Results

What the numbers present

0.730
common curated recall
throughout eight benchmarks
+11.4 pts over the subsequent open subagent, Tongyi DeepResearch 30B
Among the searchers examined, solely Opus-4.6 scores larger on common
Transfer: +17.0 on held-out vs +7.9 on source-family (2.2x hole)
Ablation: eradicating all harness mechanisms drops Recall 12.2% relative

Get Started

Run it your self

Serve: vLLM, SGLang, or Transformers
Checkpoint: pat-jj/harness-1 (Hugging Face, 21B params, BF16)
Code: github.com/pat-jj/harness-1
Paper: arXiv:2606.02373

Harness-1 returns a curated set of paperwork for a downstream answering mannequin. It doesn’t reply questions itself.

Curated by Marktechpost — practitioner-first AI/ML analysis, information, and dev tooling for engineers.


Check out the Paper, Model weights and GitHub RepoAlso, be at liberty to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b appeared first on MarkTechPost.

Similar Posts