NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code

Reinforcement studying for language brokers is rising extra complicated. Agents now handle multi-turn device use, long-running contexts, and multi-agent orchestration. The fundamental engineering problem is connecting present agent software program to coaching pipelines with out breaking how these instruments work.

NVIDIA’s analysis staff launched Polar, a rollout framework that lets researchers run reinforcement studying over any agent harness with out modifying that harness.

The Core Problem Polar Solves

An ‘agent harness’ is a device like Codex CLI, Claude Code, Qwen Code, or Pi. These harnesses handle system prompts, device formatting, context engineering, and how the agent submits patches. These particulars straight have an effect on agent habits at analysis time.

Traditional RL infrastructure requires harness logic to be rewritten behind a framework-owned setting API — usually env.init(), env.step(), env.reset() within the OpenAI Gym type. Every new harness requires new integration code. That integration also can lose execution particulars particular to the native harness path.

Polar’s key commentary is that each LLM-based agent should name a mannequin. That mannequin API boundary is a widespread interface outdoors the agent itself. Instead of integrating contained in the harness, Polar locations a proxy at that boundary.

How the Proxy Works

For every incoming mannequin request, the gateway proxy performs 4 steps:

Detect the supplier API — utilizing the request path and headers, it distinguishes Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent-style calls.
Normalize the request — converts roles, content material elements, device definitions, and technology parameters into the OpenAI Chat Completions form utilized by the native inference server.
Capture token-level knowledge — shops request messages, response messages, immediate token IDs, sampled response token IDs, end motive, and log chances.
Return the supplier form — transforms the response again into the schema the harness expects.

For streaming requests, Polar obtains a non-streaming upstream response and emits a artificial provider-shaped stream. This preserves compatibility with harnesses that count on server-sent occasions whereas guaranteeing full token seize.

The solely required change to an present harness is pointing its mannequin base URL on the gateway.

Architecture: Rollout Server and Gateway Nodes

Polar has two core elements:

The rollout server accepts a JobRequest and expands it into num_samples impartial periods. Each session carries a session ID, activity ID, timeout price range, runtime specification, agent specification, trajectory builder, evaluator, and callback URL. The server dispatches periods to gateway nodes and accepts callbacks when periods full.

Gateway nodes personal the lifecycle of every session — beginning the runtime, operating the harness, constructing trajectories, evaluating output, and teardown. The gateway additionally hosts the proxy endpoint for that session’s mannequin calls, preserving completion seize tied to the session registry.

Within every gateway, remoted employee swimming pools deal with INIT, RUNNING, and POSTRUN phases. A bounded READY buffer holds initialized runtimes till a run slot is obtainable. CPU-heavy runtime preparation and evaluator prewarm proceed off the crucial path, with out blocking energetic GPU-bound agent execution. If a harness instances out after mannequin calls have been captured, the gateway nonetheless enters POSTRUN so partial traces may be recovered.

Built-in evaluators embody a session-completion reward, a configurable test-on-output evaluator, and a SWE-Bench/SWE-Gym harness evaluator. Custom evaluators may be added by way of a registry interface.

Polar at the moment helps Docker and rootless Apptainer runtimes. Built-in harness shortcuts embody codex, claude_code, gemini_cli, qwen_code, opencode, and pi.

Trajectory Reconstruction: Per Request vs. Prefix Merging

After a session completes, Polar reconstructs trainable trajectories from captured mannequin calls.

Two methods can be found:

The per_request builder treats each mannequin name as one impartial hint. It is lossless per particular person name however fragments multi-turn periods. A single coding drawback can produce a whole lot of per-request traces, rising the burden on downstream trainers.

The prefix_merging builder reconstructs longer traces the place the harness session preserves append-only dialog histories. It partitions completions into ordered chains by verifying a strict token-prefix relation between adjoining completions. Sub-agents, context compaction boundaries, and parallel agent branches naturally type separate chains. Within every merged hint, solely sampled assistant tokens are marked trainable. Canonical interstitial tokens obtain a loss masks of zero.

Ablation Results

The analysis staff benchmarks each methods on the identical mannequin, {hardware}, and topology over three coaching steps.

Metric	`per_request`	`prefix_merging`
Trainer updates	1,185	218
Wall-clock time	189.5 min	35.2 min
Speedup	—	5.39×
Avg. rollout GPU utilization	20.4%	87.7%

SWE-Bench Verified Results

Training makes use of commonplace GRPO on the Qwen3.5-4B base mannequin. The dataset is SkyRL-v0-293-data SWE-Gym (293 duties, 1 epoch, rollout batch dimension 4, 16 samples per immediate) with the Slime coach. All experiments use prefix_merging for trajectory building.

Training Rollout Reward Progress (cross@1)

Harness	First 10 Steps	Last 10 Steps
Codex	9.5%	54.5%
Claude Code	28.8%	67.0%
Qwen Code	61.6%	66.0%
Pi	61.6%	76.2%

SWE-Bench Verified Final Scores

Harness	Base	Polar RL	Gain
Codex	3.8%	26.4%	+22.6 pts
Claude Code	29.8%	34.6%	+4.8 pts
Qwen Code	34.6%	35.2%	+0.6 pts
Pi	34.2%	40.4%	+6.2 pts

The largest acquire is underneath Codex. Codex presents an unfamiliar motion protocol and patch-submission type to a Qwen mannequin not initially educated on that harness. Polar attaches the reward sign to the precise sampled tokens flowing by way of the Codex execution path, so GRPO optimizes the habits the mannequin makes use of at analysis time. Under the native Qwen Code harness, the place the bottom mannequin is already well-aligned, Polar nonetheless delivers a 0.6 level acquire.

Offline SFT Data Generation

Polar also can function a distributed offline knowledge technology service with no modifications to the runtime. The analysis staff demonstrates this utilizing Qwen3.5-122B-A10B on an 8×H100 server (TP=8, max_model_len=32,768) with the pi harness towards 1,638 cases from seven SWE-Gym repositories.

A trajectory is accepted into the SFT corpus provided that the SWE-Bench analysis harness confirms the agent’s patch resolves each FAIL_TO_PASS take a look at and leaves each PASS_TO_PASS take a look at inexperienced.

Repository	Attempts	Accepted	Rate
getmoto/moto	343	184	53.6%
python/mypy	257	101	39.3%
conan-io/conan	71	27	38.0%
pydantic/pydantic	81	24	29.6%
iterative/dvc	219	45	20.5%
pandas-dev/pandas	477	98	19.7%
dask/dask	141	25	17.7%
Total	1,638	504	30.8%

The run value roughly 64 GPU-hours. Accepted trajectories common 104 messages per session and 51 assistant turns.

Framework Comparison

System	Async RL	Async Rollout Staging	Rollout as Service	Harness Agnostic
Polar	✓	✓	✓	✓
ProRL Agent	✓	✓	✓	✗
SkyRL-Agent	✓	✓	✗	partial
PRIME-RL	✓	✗	✗	✗
Agent Lightning	partial	✗	partial	partial
rLLM	partial	✗	✗	✗
OpenClaw-RL	✓	✗	✗	partial

Polar is the one system on this comparability with first-class assist throughout all 4 properties.

Strengths and Limitations

Strengths

No harness code modifications required — the proxy intercepts on the mannequin API boundary
Provider-agnostic: helps Anthropic, OpenAI Chat, OpenAI Responses, and Google API codecs natively
prefix_merging reduces coach updates from 1,185 to 218 and cuts wall-clock time 5.39×
Works for each on-line RL and offline SFT knowledge technology with the identical runtime
Harness-native RL delivers giant good points for unfamiliar execution paths — 22.6 pts on Codex
Partial traces are recovered when a harness instances out mid-session
Released as open supply underneath NeMo Gym

Limitations

Reward design, evaluator high quality, and distribution shift stay the researcher’s duty
Requires the harness to assist a configurable mannequin base URL
Token-level seize depends upon the serving stack supplying dependable token IDs and log chances
per_request technique produced reward hacking in experiments attributable to noisy credit score task on the session stage; session normalization and PRM-style credit score task are on the roadmap

Marktechpost’s Visual Explainer

NVIDIA Research
Polar — Agentic RL Framework

arXiv:2605.24220

NeMo Gym — May 2026

Polar: Agentic RL
on Any Harness

NVIDIA’s rollout framework trains LLM brokers by way of RL with out modifying their harnesses. A mannequin API proxy captures token-level interactions and reconstructs trainer-ready trajectories.

GRPO Training
Token-Faithful Trajectories
SWE-Bench Verified
Apache-2.0
NeMo Gym

01 — The Problem

Why RL Integration With Agent Harnesses Is Hard

Harnesses like Codex CLI, Claude Code, Qwen Code, and Pi handle system prompts, device formatting, and patch submission. Traditional RL requires rewriting this logic behind a framework-owned setting API.

Every new harness requires new integration code

Systems like SkyRL-Agent and PRIME-RL require brokers to evolve to RL infrastructure, not the opposite method round.

Integration loses native execution particulars

Rewriting a harness behind an env API can drop context insurance policies, device schemas, and orchestration logic that matter at eval time.

Polar’s key perception

Every LLM-based agent should name a mannequin. Polar locations a proxy at that API boundary as an alternative of integrating contained in the harness.

02 — The Proxy

How Polar Captures LLM Calls (4 Steps)

The solely change to an present harness is pointing its mannequin base URL on the gateway.

Detect the supplier API

Distinguishes Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent from request path and headers.

Normalize the request

Converts roles, content material elements, device definitions, and technology parameters into the OpenAI Chat Completions form for the native inference server.

Capture token-level knowledge

Stores request messages, response messages, immediate token IDs, sampled response token IDs, end motive, and log chances.

Return the supplier form

Transforms the response again into the schema the harness expects. Streaming requests obtain a artificial provider-shaped stream.

03 — Architecture

Rollout Server & Gateway Nodes

Rollout Server

Accepts a JobRequest, expands into num_samples periods. Each session carries session ID, activity ID, timeout, runtime spec, agent spec, trajectory builder, evaluator, and callback URL. Dispatches to gateways and tracks standing.

Gateway Nodes

Own the complete session lifecycle: begin runtime — run harness — construct trajectories — consider — teardown. Worker swimming pools INIT / READY / RUNNING / POSTRUN run in isolation. Times-out gracefully; partial traces are recovered.

Runtimes: Docker & rootless Apptainer

Built-in harnesses:

codex
claude_code
gemini_cli
qwen_code
opencode
pi

Built-in evaluators:

session-completion reward
test-on-output
SWE-Bench / SWE-Gym harness

04 — Trajectory Reconstruction

per_request vs. prefix_merging

per_request

Every mannequin name turns into one hint. Lossless per name however fragments multi-turn periods. One coding drawback can produce a whole lot of traces. Produces reward hacking at session stage attributable to noisy credit score task.

prefix_merging

Reconstructs longer traces by way of strict token-prefix relation. Sub-agents, context compaction, and parallel branches type separate chains. Only sampled tokens are trainable; interstitials are loss-masked to zero.

Ablation — similar mannequin, {hardware} & topology, 3 coaching steps

Metric	per_request	prefix_merging
Trainer updates	1,185	218
Wall-clock time	189.5 min	35.2 min
Speedup	—	5.39×
Avg. rollout GPU util.	20.4%	87.7%

05 — SWE-Bench Verified Results

GRPO on Qwen3.5-4B Across Four Harnesses

SkyRL-v0-293-data — 293 duties — 1 epoch — batch dimension 4 — 16 samples/immediate — Slime coach — prefix_merging

Harness	Base	Polar RL	Gain
Codex	3.8%	26.4%	+22.6 pts
Claude Code	29.8%	34.6%	+4.8 pts
Qwen Code	34.6%	35.2%	+0.6 pts
Pi	34.2%	40.4%	+6.2 pts

+22.6
pts acquire on Codex
(3.8% → 26.4%)

5.39×
sooner coaching with
prefix_merging

06 — Offline SFT Data Generation

Generating SFT Trajectories at Scale

Qwen3.5-122B-A10B — 8×H100 (TP=8, max_model_len=32,768) — pi harness — 1,638 cases — ~64 GPU-hours — Apache-2.0

Repository	Attempts	Accepted	Rate
getmoto/moto	343	184	53.6%
python/mypy	257	101	39.3%
conan-io/conan	71	27	38.0%
pydantic/pydantic	81	24	29.6%
iterative/dvc	219	45	20.5%
pandas-dev/pandas	477	98	19.7%
dask/dask	141	25	17.7%
Total	1,638	504	30.8%

Avg. 104 messages/session — 51 assistant turns — 90/10 prepare/take a look at cut up by repository

07 — Key Takeaways

What Engineers Should Know

Polar trains LLM brokers by way of a mannequin API proxy — no harness code modifications required.
Supports Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent APIs natively.
prefix_merging cuts coach updates from 1,185 to 218 and wall-clock time by 5.39× vs. per_request.
GRPO on Qwen3.5-4B improves SWE-Bench Verified by as much as 22.6 pts (Codex) throughout all 4 harnesses.
Works for on-line RL and offline SFT knowledge technology with the identical runtime — no orchestration modifications wanted.
Reward design, evaluator high quality, and distribution shift stay the researcher’s duty.
Code: github.com/NVIDIA-NeMo/ProRL-Agent-Server — registered as a NeMo Gym setting.

1 / 8

Marktechpost — AI Research, Simplified for Engineers
arXiv:2605.24220

Key Takeaways

Polar trains LLM brokers by way of a mannequin API proxy — no harness code modifications required
Supports Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent APIs
Using GRPO on Qwen3.5-4B, Polar improves SWE-Bench Verified by as much as 22.6 factors throughout 4 coding harnesses
prefix_merging trajectory reconstruction delivers a 5.39× wall-clock speedup over per_request
Generated 504 accepted SFT trajectories from 1,638 makes an attempt (30.8%) at ~64 GPU-hours; launched underneath Apache-2.0
Rewrites ProRL Agent; registered as a NeMo Gym setting

Check out the Paper and GitHub Repo. Also, be happy to comply with us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code appeared first on MarkTechPost.

NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code

The Core Problem Polar Solves

How the Proxy Works

Architecture: Rollout Server and Gateway Nodes