Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b
Most search brokers are educated as insurance policies over a rising transcript. The mannequin decides search. It should additionally bear in mind what it noticed, which proof issues, and which claims it checked. A staff of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks an excessive amount of. Reinforcement studying finally ends up optimizing each search choices and routine bookkeeping without delay.
Their reply is Harness-1, a 20B retrieval subagent constructed on gpt-oss-20b. It was educated with reinforcement studying inside a stateful search harness. The harness holds the bookkeeping. The coverage retains the semantic choices. The weights and harness code are publicly launched.

What is Harness-1 Actually
Harness-1 produces a ranked set of paperwork for a downstream answering mannequin. It doesn’t reply questions itself. It runs inside a state-machine harness centered on a per-episode WORKINGMEMORY.
Each flip works as a loop. The harness renders compact search state together with latest actions. The mannequin emits one structured motion. The harness executes it, updates state, and renders the subsequent statement.
The Stateful Harness: What Moves Out of the Policy
The analysis staff calls its precept stateful cognitive offloading. The coverage decides what to look, curate, and confirm, and when to cease. The harness maintains the recoverable state round these choices.
That state contains a number of items. A candidate pool holds compressed, deduplicated paperwork. An importance-tagged curated set is the ultimate output, capped at 30 paperwork. Tags take 4 values: very_high, excessive, honest, or low. A full-text retailer retains each retrieved chunk exterior the immediate.
An proof graph provides construction. A regex extractor scans every chunk for correct nouns, years, and dates. The harness then renders frequent entities, bridge paperwork, and singletons. Bridge paperwork include two or extra frequent entities. Singletons seem in a single doc and recommend follow-up leads.
The coverage works by means of eight instruments. These are fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, confirm, and end_search. Search outputs are compressed with sentence-BM25, holding the highest 4 sentences. Two-level deduplication removes repeats by chunk ID and content material fingerprint.
One design selection addresses chilly begins. The first profitable search auto-seeds the curated set with eight reranked outcomes at honest significance. The coverage then promotes robust paperwork and removes weak ones. This turns the duty from constructing from scratch into refinement.
The analysis staff names three necessities for a trainable harness. These are warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 implements all three.
How It is Trained
Training splits alongside the identical line because the harness. Supervised fine-tuning teaches the mannequin to function the interface. Reinforcement studying improves search choices over the maintained state.
A single instructor, GPT-5.4, runs dwell inside the complete harness. After filtering, 899 trajectories stay for SFT. The mannequin makes use of LoRA at rank 32 for 3 epochs. The step-550 checkpoint initializes RL.
RL makes use of on-policy CISPO with a 40-turn cap and terminal-only reward. It trains solely on SEC queries. Groups with an identical rewards are dropped from the gradient. Training ran on Tinker.
The reward separates discovery from choice. It additionally provides a tool-diversity bonus. Without that bonus, the agent collapses to repeated search. Curated recall then plateaus close to 0.53. With the bonus, range stabilizes and recall reaches about 0.60.
The Benchmark Case
Harness-1 was evaluated on eight benchmarks spanning net, finance, patents, and multi-hop QA. The major metric is curated recall: protection of related paperwork within the remaining set. Trajectory recall counts proof encountered anyplace within the episode.
| Model | Type | Avg Curated Recall | Avg Trajectory Recall |
|---|---|---|---|
| Harness-1 (20B) | Open small | 0.730 | 0.807 |
| Tongyi DeepResearch 30B | Open small | 0.616 | 0.673 |
| Context-1 (20B) | Open small | 0.603 | 0.756 |
| Search-R1 (32B) | Open small | 0.289 | 0.289 |
| GPT-OSS-20B | Open small | 0.262 | 0.590 |
| Qwen3 (32B) | Open small | 0.216 | 0.446 |
| Opus-4.6 | Frontier | 0.764 | 0.794 |
| GPT-5.4 | Frontier | 0.709 | 0.752 |
| Sonnet-4.6 | Frontier | 0.688 | 0.725 |
| Kimi-K2.5 | Frontier | 0.647 | 0.794 |
| GPT-OSS-120B | Frontier | 0.496 | 0.769 |
Harness-1 reaches 0.730 common curated recall. That beats the subsequent open subagent, Tongyi DeepResearch 30B, by 11.4 factors. Among the frontier searchers examined, solely Opus-4.6 scores larger on common.
The switch sample is the clearest sign of the mechanism. SFT used 4 benchmark households; RL used solely SEC. On these source-family duties, Harness-1 gained 7.9 factors over the closest open baseline. On 4 held-out benchmarks, it gained 17.0 factors. That is a 2.2x bigger acquire on duties furthest from coaching knowledge.
Ablations help the harness declare. Disabling all harness mechanisms drops Recall by 12.2 % relative on BrowseComp+. The educated coverage retains looking out however can not rank what it sees.

Use Cases
The technique targets evidence-seeking retrieval the place paperwork help a solution. Several workflows match this form.
One is literature and patent evaluation. The proof graph and curated set assist manage many sources. Another is financial-filing evaluation. The SEC case research recovers an actual executive-transition date throughout a number of 8-Ks.
A third is multi-hop fact-checking. The fan_out_search and confirm instruments resolve ambiguous entities earlier than committing. A fourth is modular RAG. The curated set feeds a frozen generator, and higher units yield larger reply accuracy.
Strengths and Weaknesses
Strengths
- Highest common curated recall among the many open fashions examined, and behind solely Opus-4.6 total.
- Gains maintain on held-out benchmarks, suggesting domain-general search operations.
- Trained on 4,352 distinctive gadgets, far fewer than a number of baselines.
- Open checkpoint and harness code, servable with widespread runtimes.
Weaknesses
- The proof graph makes use of regex extraction, not full entity linking.
- The confirm instrument is an LLM proxy that may err on ambiguous claims.
- Sentence-BM25 compression might drop context tied to discourse construction.
- The analysis staff stories level estimates with out full confidence intervals.
Key Takeaways
- Harness-1 is a 20B search agent that strikes search bookkeeping into the atmosphere, leaving semantic choices to the coverage.
- It hits 0.730 common curated recall throughout eight benchmarks, beating the subsequent open subagent by 11.4 factors.
- Among the searchers examined, solely Opus-4.6 scores larger on common curated recall.
- Gains are largest on held-out benchmarks (+17.0 vs +7.9 factors), suggesting the discovered search operations switch.
- Weights and harness code are public, servable through vLLM, SGLang, or Transformers.
Marktechpost’s Visual Explainer
1 / 7
Check out the Paper, Model weights and GitHub Repo. Also, be at liberty to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The publish Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b appeared first on MarkTechPost.
