NVIDIA Researchers Propose Reinforcement Learning Pretraining (RLP): Reinforcement as a Pretraining Objective for Building Reasoning During Pretraining

NVIDIA AI has launched Reinforcement Learning Pretraining (RLP), a coaching goal that injects reinforcement studying into the pretraining stage reasonably than deferring it to post-training. The core concept is easy and testable: deal with a quick chain-of-thought (CoT) as an motion sampled earlier than next-token prediction and reward it by the data achieve it supplies on the noticed subsequent token, measured towards a no-think EMA baseline. This produces a verifier-free, dense, position-wise reward that may be utilized to peculiar textual content streams at pretraining scale.

https://github.com/NVlabs/RLP/blob/essential/pdf/RLP_Reinforcement_as_a_Pretraining_Objective.pdf

Mechanism: Information-Gain Rewards with an EMA Counterfactual

RLP makes use of a single community (shared parameters) to (1) pattern a CoT coverage
𝜋
𝜃
(
𝑐
𝑡
∣
𝑥
<
𝑡
)
π
θ

(c
t

∣x
<t

) after which (2) rating the following token
𝑝
𝜃
(
𝑥
𝑡
∣
𝑥
<
𝑡
,
𝑐
𝑡
)
p
θ

(x
t

∣x
<t

,c
t

). A slowly up to date EMA instructor
𝑝
𝜙
(
𝑥
𝑡
∣
𝑥
<
𝑡
)
p
ϕ

(x
t

∣x
<t

) supplies a no-think counterfactual. The per-token reward is the log-likelihood ratio-

r(ct)=logpθ(xt∣x<t,ct)−logpϕ(xt∣x<t), computed beneath instructor forcing. Training updates solely the thought tokens utilizing a clipped surrogate with per-token significance ratios and group-relative benefits (a number of sampled ideas per context scale back variance). The goal maximizes anticipated data achieve; theoretical outcomes join the anticipated reward to reductions in cross-entropy and certain it by way of marginalization over ideas.

Why this issues technically: in contrast to prior “reinforcement pretraining” variants that depend on sparse, binary correctness alerts or proxy filters, RLP’s dense, verifier-free reward attaches position-wise credit score wherever considering improves prediction, enabling updates at each token place basically web-scale corpora with out exterior verifiers or curated reply keys.

Understanding the Results

Qwen3-1.7B-Base: Pretraining with RLP improved the general math+science common by ~19% vs the bottom mannequin and ~17% vs compute-matched steady pretraining (CPT). After similar post-training (SFT + RLVR) throughout all variants, the RLP-initialized mannequin retained a ~7–8% relative benefit, with the biggest good points on reasoning-heavy benchmarks (AIME25, MMLU-Pro).

Nemotron-Nano-12B v2: Applying RLP to a 12B hybrid Mamba-Transformer checkpoint yielded an total common improve from 42.81% to 61.32% and an absolute +23% achieve on scientific reasoning, though the RLP run used ~200B fewer tokens (coaching for 19.8T vs 20T tokens; RLP utilized for 250M tokens). This highlights information effectivity and architecture-agnostic habits.

RPT comparability: Under matched information and compute with Omni-MATH-style settings, RLP outperformed RPT on math, science, and total averages—attributed to RLP’s steady information-gain reward versus RPT’s sparse binary sign and entropy-filtered tokens.

Positioning vs. Post-Training RL and Data Curation

Reinforcement Learning Pretraining (RLP) is orthogonal to post-training pipelines (SFT, RLVR) and reveals compounding enhancements after normal alignment. Because the reward is computed from mannequin log-evidence reasonably than exterior verifiers, it scales to domain-agnostic corpora (net crawl, tutorial textual content, textbooks) and SFT-style reasoning corpora, avoiding the brittleness of slim curated datasets. In compute-matched comparisons (together with CPT with 35× extra tokens to match FLOPs), RLP nonetheless led on total averages, suggesting the enhancements derive from goal design, not finances.

Key Takeaways

RLP makes reasoning a pretraining goal: pattern a chain-of-thought earlier than next-token prediction and reward it by data achieve over a no-think EMA baseline.
Verifier-free, dense, position-wise sign: works on peculiar textual content streams with out exterior graders, enabling scalable pretraining updates on each token.
Qwen3-1.7B outcomes: +19% vs Base and +17% vs compute-matched CPT throughout pretraining; with similar SFT+RLVR, RLP retains ~7–8% good points (largest on AIME25, MMLU-Pro).
Nemotron-Nano-12B v2: total common rises 42.81% → 61.32% (+18.51 pp; ~35–43% rel.) and +23 factors on scientific reasoning, utilizing ~200B fewer NTP tokens.
Training particulars that matter: replace gradients solely on thought tokens with a clipped surrogate and group-relative benefits; extra rollouts (≈16) and longer thought lengths (≈2048) assist; token-level KL anchoring affords no profit.

Conclusion

RLP reframes pretraining to straight reward “think-before-predict” habits utilizing a verifier-free, information-gain sign, yielding sturdy reasoning good points that persist by means of similar SFT+RLVR and lengthen throughout architectures (Qwen3-1.7B, Nemotron-Nano-12B v2). The technique’s goal—contrasting CoT-conditioned chance towards a no-think EMA baseline—integrates cleanly into large-scale pipelines with out curated verifiers, making it a sensible improve to next-token pretraining reasonably than a post-training add-on.

Check out the Paper, Code and Project Page. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish NVIDIA Researchers Propose Reinforcement Learning Pretraining (RLP): Reinforcement as a Pretraining Objective for Building Reasoning During Pretraining appeared first on MarkTechPost.

NVIDIA Researchers Propose Reinforcement Learning Pretraining (RLP): Reinforcement as a Pretraining Objective for Building Reasoning During Pretraining