|

NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code

Reinforcement studying for language brokers is rising extra complicated. Agents now handle multi-turn device use, long-running contexts, and multi-agent orchestration. The fundamental engineering problem is connecting present agent software program to coaching pipelines with out breaking how these instruments work.

NVIDIA’s analysis staff launched Polar, a rollout framework that lets researchers run reinforcement studying over any agent harness with out modifying that harness.

The Core Problem Polar Solves

An ‘agent harness’ is a device like Codex CLI, Claude Code, Qwen Code, or Pi. These harnesses handle system prompts, device formatting, context engineering, and how the agent submits patches. These particulars straight have an effect on agent habits at analysis time.

Traditional RL infrastructure requires harness logic to be rewritten behind a framework-owned setting API — usually env.init(), env.step(), env.reset() within the OpenAI Gym type. Every new harness requires new integration code. That integration also can lose execution particulars particular to the native harness path.

Polar’s key commentary is that each LLM-based agent should name a mannequin. That mannequin API boundary is a widespread interface outdoors the agent itself. Instead of integrating contained in the harness, Polar locations a proxy at that boundary.

How the Proxy Works

For every incoming mannequin request, the gateway proxy performs 4 steps:

  1. Detect the supplier API — utilizing the request path and headers, it distinguishes Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent-style calls.
  2. Normalize the request — converts roles, content material elements, device definitions, and technology parameters into the OpenAI Chat Completions form utilized by the native inference server.
  3. Capture token-level knowledge — shops request messages, response messages, immediate token IDs, sampled response token IDs, end motive, and log chances.
  4. Return the supplier form — transforms the response again into the schema the harness expects.

For streaming requests, Polar obtains a non-streaming upstream response and emits a artificial provider-shaped stream. This preserves compatibility with harnesses that count on server-sent occasions whereas guaranteeing full token seize.

The solely required change to an present harness is pointing its mannequin base URL on the gateway.

https://arxiv.org/pdf/2605.24220

Architecture: Rollout Server and Gateway Nodes

Polar has two core elements:

The rollout server accepts a JobRequest and expands it into num_samples impartial periods. Each session carries a session ID, activity ID, timeout price range, runtime specification, agent specification, trajectory builder, evaluator, and callback URL. The server dispatches periods to gateway nodes and accepts callbacks when periods full.

Gateway nodes personal the lifecycle of every session — beginning the runtime, operating the harness, constructing trajectories, evaluating output, and teardown. The gateway additionally hosts the proxy endpoint for that session’s mannequin calls, preserving completion seize tied to the session registry.

Within every gateway, remoted employee swimming pools deal with INIT, RUNNING, and POSTRUN phases. A bounded READY buffer holds initialized runtimes till a run slot is obtainable. CPU-heavy runtime preparation and evaluator prewarm proceed off the crucial path, with out blocking energetic GPU-bound agent execution. If a harness instances out after mannequin calls have been captured, the gateway nonetheless enters POSTRUN so partial traces may be recovered.

Built-in evaluators embody a session-completion reward, a configurable test-on-output evaluator, and a SWE-Bench/SWE-Gym harness evaluator. Custom evaluators may be added by way of a registry interface.

Polar at the moment helps Docker and rootless Apptainer runtimes. Built-in harness shortcuts embody codex, claude_code, gemini_cli, qwen_code, opencode, and pi.

Trajectory Reconstruction: Per Request vs. Prefix Merging

After a session completes, Polar reconstructs trainable trajectories from captured mannequin calls.

Two methods can be found:

The per_request builder treats each mannequin name as one impartial hint. It is lossless per particular person name however fragments multi-turn periods. A single coding drawback can produce a whole lot of per-request traces, rising the burden on downstream trainers.

The prefix_merging builder reconstructs longer traces the place the harness session preserves append-only dialog histories. It partitions completions into ordered chains by verifying a strict token-prefix relation between adjoining completions. Sub-agents, context compaction boundaries, and parallel agent branches naturally type separate chains. Within every merged hint, solely sampled assistant tokens are marked trainable. Canonical interstitial tokens obtain a loss masks of zero.

Ablation Results

The analysis staff benchmarks each methods on the identical mannequin, {hardware}, and topology over three coaching steps.

Metric per_request prefix_merging
Trainer updates 1,185 218
Wall-clock time 189.5 min 35.2 min
Speedup 5.39×
Avg. rollout GPU utilization 20.4% 87.7%

SWE-Bench Verified Results

Training makes use of commonplace GRPO on the Qwen3.5-4B base mannequin. The dataset is SkyRL-v0-293-data SWE-Gym (293 duties, 1 epoch, rollout batch dimension 4, 16 samples per immediate) with the Slime coach. All experiments use prefix_merging for trajectory building.

Training Rollout Reward Progress (cross@1)

Harness First 10 Steps Last 10 Steps
Codex 9.5% 54.5%
Claude Code 28.8% 67.0%
Qwen Code 61.6% 66.0%
Pi 61.6% 76.2%

SWE-Bench Verified Final Scores

Harness Base Polar RL Gain
Codex 3.8% 26.4% +22.6 pts
Claude Code 29.8% 34.6% +4.8 pts
Qwen Code 34.6% 35.2% +0.6 pts
Pi 34.2% 40.4% +6.2 pts

The largest acquire is underneath Codex. Codex presents an unfamiliar motion protocol and patch-submission type to a Qwen mannequin not initially educated on that harness. Polar attaches the reward sign to the precise sampled tokens flowing by way of the Codex execution path, so GRPO optimizes the habits the mannequin makes use of at analysis time. Under the native Qwen Code harness, the place the bottom mannequin is already well-aligned, Polar nonetheless delivers a 0.6 level acquire.

Offline SFT Data Generation

Polar also can function a distributed offline knowledge technology service with no modifications to the runtime. The analysis staff demonstrates this utilizing Qwen3.5-122B-A10B on an 8×H100 server (TP=8, max_model_len=32,768) with the pi harness towards 1,638 cases from seven SWE-Gym repositories.

A trajectory is accepted into the SFT corpus provided that the SWE-Bench analysis harness confirms the agent’s patch resolves each FAIL_TO_PASS take a look at and leaves each PASS_TO_PASS take a look at inexperienced.

Repository Attempts Accepted Rate
getmoto/moto 343 184 53.6%
python/mypy 257 101 39.3%
conan-io/conan 71 27 38.0%
pydantic/pydantic 81 24 29.6%
iterative/dvc 219 45 20.5%
pandas-dev/pandas 477 98 19.7%
dask/dask 141 25 17.7%
Total 1,638 504 30.8%

The run value roughly 64 GPU-hours. Accepted trajectories common 104 messages per session and 51 assistant turns.

Framework Comparison

System Async RL Async Rollout Staging Rollout as Service Harness Agnostic
Polar
ProRL Agent
SkyRL-Agent partial
PRIME-RL
Agent Lightning partial partial partial
rLLM partial
OpenClaw-RL partial

Polar is the one system on this comparability with first-class assist throughout all 4 properties.

Strengths and Limitations

Strengths

  • No harness code modifications required — the proxy intercepts on the mannequin API boundary
  • Provider-agnostic: helps Anthropic, OpenAI Chat, OpenAI Responses, and Google API codecs natively
  • prefix_merging reduces coach updates from 1,185 to 218 and cuts wall-clock time 5.39×
  • Works for each on-line RL and offline SFT knowledge technology with the identical runtime
  • Harness-native RL delivers giant good points for unfamiliar execution paths — 22.6 pts on Codex
  • Partial traces are recovered when a harness instances out mid-session
  • Released as open supply underneath NeMo Gym

Limitations

  • Reward design, evaluator high quality, and distribution shift stay the researcher’s duty
  • Requires the harness to assist a configurable mannequin base URL
  • Token-level seize depends upon the serving stack supplying dependable token IDs and log chances
  • per_request technique produced reward hacking in experiments attributable to noisy credit score task on the session stage; session normalization and PRM-style credit score task are on the roadmap

Marktechpost’s Visual Explainer

NVIDIA Research
Polar — Agentic RL Framework

arXiv:2605.24220

NeMo Gym — May 2026

Polar: Agentic RL
on Any Harness
NVIDIA’s rollout framework trains LLM brokers by way of RL with out modifying their harnesses. A mannequin API proxy captures token-level interactions and reconstructs trainer-ready trajectories.
GRPO Training
Token-Faithful Trajectories
SWE-Bench Verified
Apache-2.0
NeMo Gym

01 — The Problem

Why RL Integration With Agent Harnesses Is Hard
Harnesses like Codex CLI, Claude Code, Qwen Code, and Pi handle system prompts, device formatting, and patch submission. Traditional RL requires rewriting this logic behind a framework-owned setting API.
1
Every new harness requires new integration code
Systems like SkyRL-Agent and PRIME-RL require brokers to evolve to RL infrastructure, not the opposite method round.

2
Integration loses native execution particulars
Rewriting a harness behind an env API can drop context insurance policies, device schemas, and orchestration logic that matter at eval time.

3
Polar’s key perception
Every LLM-based agent should name a mannequin. Polar locations a proxy at that API boundary as an alternative of integrating contained in the harness.

02 — The Proxy

How Polar Captures LLM Calls (4 Steps)
The solely change to an present harness is pointing its mannequin base URL on the gateway.
1
Detect the supplier API
Distinguishes Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent from request path and headers.

2
Normalize the request
Converts roles, content material elements, device definitions, and technology parameters into the OpenAI Chat Completions form for the native inference server.

3
Capture token-level knowledge
Stores request messages, response messages, immediate token IDs, sampled response token IDs, end motive, and log chances.

4
Return the supplier form
Transforms the response again into the schema the harness expects. Streaming requests obtain a artificial provider-shaped stream.

03 — Architecture

Rollout Server & Gateway Nodes
Rollout Server
Accepts a JobRequest, expands into num_samples periods. Each session carries session ID, activity ID, timeout, runtime spec, agent spec, trajectory builder, evaluator, and callback URL. Dispatches to gateways and tracks standing.

Gateway Nodes
Own the complete session lifecycle: begin runtime — run harness — construct trajectories — consider — teardown. Worker swimming pools INIT / READY / RUNNING / POSTRUN run in isolation. Times-out gracefully; partial traces are recovered.

Runtimes: Docker & rootless Apptainer
Built-in harnesses:
codex
claude_code
gemini_cli
qwen_code
opencode
pi
Built-in evaluators:
session-completion reward
test-on-output
SWE-Bench / SWE-Gym harness

04 — Trajectory Reconstruction

per_request vs. prefix_merging
per_request
Every mannequin name turns into one hint. Lossless per name however fragments multi-turn periods. One coding drawback can produce a whole lot of traces. Produces reward hacking at session stage attributable to noisy credit score task.

prefix_merging
Reconstructs longer traces by way of strict token-prefix relation. Sub-agents, context compaction, and parallel branches type separate chains. Only sampled tokens are trainable; interstitials are loss-masked to zero.

Ablation — similar mannequin, {hardware} & topology, 3 coaching steps
Metric per_request prefix_merging
Trainer updates 1,185 218
Wall-clock time 189.5 min 35.2 min
Speedup 5.39×
Avg. rollout GPU util. 20.4% 87.7%

05 — SWE-Bench Verified Results

GRPO on Qwen3.5-4B Across Four Harnesses
SkyRL-v0-293-data — 293 duties — 1 epoch — batch dimension 4 — 16 samples/immediate — Slime coach — prefix_merging
Harness Base Polar RL Gain
Codex 3.8% 26.4% +22.6 pts
Claude Code 29.8% 34.6% +4.8 pts
Qwen Code 34.6% 35.2% +0.6 pts
Pi 34.2% 40.4% +6.2 pts
+22.6
pts acquire on Codex
(3.8% → 26.4%)
5.39×
sooner coaching with
prefix_merging

06 — Offline SFT Data Generation

Generating SFT Trajectories at Scale
Qwen3.5-122B-A10B — 8×H100 (TP=8, max_model_len=32,768) — pi harness — 1,638 cases — ~64 GPU-hours — Apache-2.0
Repository Attempts Accepted Rate
getmoto/moto 343 184 53.6%
python/mypy 257 101 39.3%
conan-io/conan 71 27 38.0%
pydantic/pydantic 81 24 29.6%
iterative/dvc 219 45 20.5%
pandas-dev/pandas 477 98 19.7%
dask/dask 141 25 17.7%
Total 1,638 504 30.8%
Avg. 104 messages/session — 51 assistant turns — 90/10 prepare/take a look at cut up by repository

07 — Key Takeaways

What Engineers Should Know
  • Polar trains LLM brokers by way of a mannequin API proxy — no harness code modifications required.
  • Supports Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent APIs natively.
  • prefix_merging cuts coach updates from 1,185 to 218 and wall-clock time by 5.39× vs. per_request.
  • GRPO on Qwen3.5-4B improves SWE-Bench Verified by as much as 22.6 pts (Codex) throughout all 4 harnesses.
  • Works for on-line RL and offline SFT knowledge technology with the identical runtime — no orchestration modifications wanted.
  • Reward design, evaluator high quality, and distribution shift stay the researcher’s duty.
  • Code: github.com/NVIDIA-NeMo/ProRL-Agent-Server — registered as a NeMo Gym setting.

1 / 8

Marktechpost — AI Research, Simplified for Engineers
arXiv:2605.24220

Key Takeaways

  • Polar trains LLM brokers by way of a mannequin API proxy — no harness code modifications required
  • Supports Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent APIs
  • Using GRPO on Qwen3.5-4B, Polar improves SWE-Bench Verified by as much as 22.6 factors throughout 4 coding harnesses
  • prefix_merging trajectory reconstruction delivers a 5.39× wall-clock speedup over per_request
  • Generated 504 accepted SFT trajectories from 1,638 makes an attempt (30.8%) at ~64 GPU-hours; launched underneath Apache-2.0
  • Rewrites ProRL Agent; registered as a NeMo Gym setting


Check out the Paper and GitHub RepoAlso, be happy to comply with us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code appeared first on MarkTechPost.

Similar Posts