NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code
Reinforcement studying for language brokers is rising extra complicated. Agents now handle multi-turn device use, long-running contexts, and multi-agent orchestration. The fundamental engineering problem is connecting present agent software program to coaching pipelines with out breaking how these instruments work.
NVIDIA’s analysis staff launched Polar, a rollout framework that lets researchers run reinforcement studying over any agent harness with out modifying that harness.
The Core Problem Polar Solves
An ‘agent harness’ is a device like Codex CLI, Claude Code, Qwen Code, or Pi. These harnesses handle system prompts, device formatting, context engineering, and how the agent submits patches. These particulars straight have an effect on agent habits at analysis time.
Traditional RL infrastructure requires harness logic to be rewritten behind a framework-owned setting API — usually env.init(), env.step(), env.reset() within the OpenAI Gym type. Every new harness requires new integration code. That integration also can lose execution particulars particular to the native harness path.
Polar’s key commentary is that each LLM-based agent should name a mannequin. That mannequin API boundary is a widespread interface outdoors the agent itself. Instead of integrating contained in the harness, Polar locations a proxy at that boundary.
How the Proxy Works
For every incoming mannequin request, the gateway proxy performs 4 steps:
- Detect the supplier API — utilizing the request path and headers, it distinguishes Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent-style calls.
- Normalize the request — converts roles, content material elements, device definitions, and technology parameters into the OpenAI Chat Completions form utilized by the native inference server.
- Capture token-level knowledge — shops request messages, response messages, immediate token IDs, sampled response token IDs, end motive, and log chances.
- Return the supplier form — transforms the response again into the schema the harness expects.
For streaming requests, Polar obtains a non-streaming upstream response and emits a artificial provider-shaped stream. This preserves compatibility with harnesses that count on server-sent occasions whereas guaranteeing full token seize.
The solely required change to an present harness is pointing its mannequin base URL on the gateway.

Architecture: Rollout Server and Gateway Nodes
Polar has two core elements:
The rollout server accepts a JobRequest and expands it into num_samples impartial periods. Each session carries a session ID, activity ID, timeout price range, runtime specification, agent specification, trajectory builder, evaluator, and callback URL. The server dispatches periods to gateway nodes and accepts callbacks when periods full.
Gateway nodes personal the lifecycle of every session — beginning the runtime, operating the harness, constructing trajectories, evaluating output, and teardown. The gateway additionally hosts the proxy endpoint for that session’s mannequin calls, preserving completion seize tied to the session registry.
Within every gateway, remoted employee swimming pools deal with INIT, RUNNING, and POSTRUN phases. A bounded READY buffer holds initialized runtimes till a run slot is obtainable. CPU-heavy runtime preparation and evaluator prewarm proceed off the crucial path, with out blocking energetic GPU-bound agent execution. If a harness instances out after mannequin calls have been captured, the gateway nonetheless enters POSTRUN so partial traces may be recovered.
Built-in evaluators embody a session-completion reward, a configurable test-on-output evaluator, and a SWE-Bench/SWE-Gym harness evaluator. Custom evaluators may be added by way of a registry interface.
Polar at the moment helps Docker and rootless Apptainer runtimes. Built-in harness shortcuts embody codex, claude_code, gemini_cli, qwen_code, opencode, and pi.
Trajectory Reconstruction: Per Request vs. Prefix Merging
After a session completes, Polar reconstructs trainable trajectories from captured mannequin calls.
Two methods can be found:
The per_request builder treats each mannequin name as one impartial hint. It is lossless per particular person name however fragments multi-turn periods. A single coding drawback can produce a whole lot of per-request traces, rising the burden on downstream trainers.
The prefix_merging builder reconstructs longer traces the place the harness session preserves append-only dialog histories. It partitions completions into ordered chains by verifying a strict token-prefix relation between adjoining completions. Sub-agents, context compaction boundaries, and parallel agent branches naturally type separate chains. Within every merged hint, solely sampled assistant tokens are marked trainable. Canonical interstitial tokens obtain a loss masks of zero.
Ablation Results
The analysis staff benchmarks each methods on the identical mannequin, {hardware}, and topology over three coaching steps.
| Metric | per_request |
prefix_merging |
|---|---|---|
| Trainer updates | 1,185 | 218 |
| Wall-clock time | 189.5 min | 35.2 min |
| Speedup | — | 5.39× |
| Avg. rollout GPU utilization | 20.4% | 87.7% |
SWE-Bench Verified Results
Training makes use of commonplace GRPO on the Qwen3.5-4B base mannequin. The dataset is SkyRL-v0-293-data SWE-Gym (293 duties, 1 epoch, rollout batch dimension 4, 16 samples per immediate) with the Slime coach. All experiments use prefix_merging for trajectory building.
Training Rollout Reward Progress (cross@1)
| Harness | First 10 Steps | Last 10 Steps |
|---|---|---|
| Codex | 9.5% | 54.5% |
| Claude Code | 28.8% | 67.0% |
| Qwen Code | 61.6% | 66.0% |
| Pi | 61.6% | 76.2% |
SWE-Bench Verified Final Scores
| Harness | Base | Polar RL | Gain |
|---|---|---|---|
| Codex | 3.8% | 26.4% | +22.6 pts |
| Claude Code | 29.8% | 34.6% | +4.8 pts |
| Qwen Code | 34.6% | 35.2% | +0.6 pts |
| Pi | 34.2% | 40.4% | +6.2 pts |
The largest acquire is underneath Codex. Codex presents an unfamiliar motion protocol and patch-submission type to a Qwen mannequin not initially educated on that harness. Polar attaches the reward sign to the precise sampled tokens flowing by way of the Codex execution path, so GRPO optimizes the habits the mannequin makes use of at analysis time. Under the native Qwen Code harness, the place the bottom mannequin is already well-aligned, Polar nonetheless delivers a 0.6 level acquire.
Offline SFT Data Generation
Polar also can function a distributed offline knowledge technology service with no modifications to the runtime. The analysis staff demonstrates this utilizing Qwen3.5-122B-A10B on an 8×H100 server (TP=8, max_model_len=32,768) with the pi harness towards 1,638 cases from seven SWE-Gym repositories.
A trajectory is accepted into the SFT corpus provided that the SWE-Bench analysis harness confirms the agent’s patch resolves each FAIL_TO_PASS take a look at and leaves each PASS_TO_PASS take a look at inexperienced.
| Repository | Attempts | Accepted | Rate |
|---|---|---|---|
| getmoto/moto | 343 | 184 | 53.6% |
| python/mypy | 257 | 101 | 39.3% |
| conan-io/conan | 71 | 27 | 38.0% |
| pydantic/pydantic | 81 | 24 | 29.6% |
| iterative/dvc | 219 | 45 | 20.5% |
| pandas-dev/pandas | 477 | 98 | 19.7% |
| dask/dask | 141 | 25 | 17.7% |
| Total | 1,638 | 504 | 30.8% |
The run value roughly 64 GPU-hours. Accepted trajectories common 104 messages per session and 51 assistant turns.
Framework Comparison
| System | Async RL | Async Rollout Staging | Rollout as Service | Harness Agnostic |
|---|---|---|---|---|
| Polar | ✓ | ✓ | ✓ | ✓ |
| ProRL Agent | ✓ | ✓ | ✓ | ✗ |
| SkyRL-Agent | ✓ | ✓ | ✗ | partial |
| PRIME-RL | ✓ | ✗ | ✗ | ✗ |
| Agent Lightning | partial | ✗ | partial | partial |
| rLLM | partial | ✗ | ✗ | ✗ |
| OpenClaw-RL | ✓ | ✗ | ✗ | partial |
Polar is the one system on this comparability with first-class assist throughout all 4 properties.
Strengths and Limitations
Strengths
- No harness code modifications required — the proxy intercepts on the mannequin API boundary
- Provider-agnostic: helps Anthropic, OpenAI Chat, OpenAI Responses, and Google API codecs natively
prefix_mergingreduces coach updates from 1,185 to 218 and cuts wall-clock time 5.39×- Works for each on-line RL and offline SFT knowledge technology with the identical runtime
- Harness-native RL delivers giant good points for unfamiliar execution paths — 22.6 pts on Codex
- Partial traces are recovered when a harness instances out mid-session
- Released as open supply underneath NeMo Gym
Limitations
- Reward design, evaluator high quality, and distribution shift stay the researcher’s duty
- Requires the harness to assist a configurable mannequin base URL
- Token-level seize depends upon the serving stack supplying dependable token IDs and log chances
per_requesttechnique produced reward hacking in experiments attributable to noisy credit score task on the session stage; session normalization and PRM-style credit score task are on the roadmap
Marktechpost’s Visual Explainer
Polar — Agentic RL Framework
arXiv:2605.24220
1 / 8
arXiv:2605.24220
Key Takeaways
- Polar trains LLM brokers by way of a mannequin API proxy — no harness code modifications required
- Supports Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent APIs
- Using GRPO on Qwen3.5-4B, Polar improves SWE-Bench Verified by as much as 22.6 factors throughout 4 coding harnesses
prefix_mergingtrajectory reconstruction delivers a 5.39× wall-clock speedup overper_request- Generated 504 accepted SFT trajectories from 1,638 makes an attempt (30.8%) at ~64 GPU-hours; launched underneath Apache-2.0
- Rewrites ProRL Agent; registered as a NeMo Gym setting
Check out the Paper and GitHub Repo. Also, be happy to comply with us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The publish NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code appeared first on MarkTechPost.
