QeRL: NVFP4-Quantized Reinforcement Learning (RL) Brings 32B LLM Training to a Single H100—While Improving Exploration

What would you construct in the event you might run Reinforcement Learning (RL) post-training on a 32B LLM in 4-bit NVFP4—on a single H100—with BF16-level accuracy and 1.2–1.5× step speedups? NVIDIA researchers (with collaborators from MIT, HKU, and Tsinghua) have open-sourced QeRL (Quantization-enhanced Reinforcement Learning), a coaching framework that pushes Reinforcement Learning (RL) post-training into 4-bit FP4 (NVFP4) whereas maintaining gradient math in greater precision through LoRA. The analysis crew stories >1.5× speedups within the rollout part, ~1.8× end-to-end vs QLoRA in a single setting, and the first demonstration of RL coaching for a 32B coverage on a single H100-80GB GPU.

What QeRL adjustments within the Reinforcement Learning (RL) loop?

Most RLHF/GRPO/DAPO pipelines spend the majority of wall-clock time in rollouts (token era). QeRL shifts the coverage’s weight path to NVFP4 (FP4) with dual-level scaling and retains logits/gradients in greater precision through LoRA, so backprop stays steady whereas the sampling path hits hardware-efficient FP4×BF16 kernels (Marlin). The result’s quicker prefill/decoding throughout rollouts with out sustaining a separate full-precision coverage.

Mechanically, the analysis crew integrates Marlin-based FP4 kernels in each rollout and prefill, whereas LoRA limits trainable parameters. This instantly targets the stage that dominates RL value and latency for lengthy reasoning traces.

Quantization as exploration, made schedulable

A core empirical discovering: deterministic FP4 quantization raises coverage entropy, flattening token distributions early in coaching and bettering exploration versus 16-bit LoRA and NF4-based QLoRA baselines. To management that impact over time, QeRL introduces Adaptive Quantization Noise (AQN)—channel-wise Gaussian perturbations mapped into LayerNorm scale parameters and annealed with an exponential schedule. This retains kernel fusion intact (no further weight tensors) whereas transitioning from exploration to exploitation.

In ablations, QeRL exhibits quicker reward progress and greater last scores on math-reasoning duties below each GRPO and DAPO, aligning with the speculation that structured noise in parameter house will be a helpful exploration driver in RL, despite the fact that such noise is often detrimental in supervised fine-tuning.

Reported outcomes

On Qwen2.5 spine mannequin, the analysis crew present that NVFP4+LoRA outperforms vanilla LoRA and QLoRA in rollout throughput and total coaching time, with >2× rollout throughput on 14B/32B fashions towards QLoRA and ~1.8× end-to-end vs QLoRA in a consultant setup. They additionally reveal coaching a 32B coverage with GRPO on a single H100-80GB, enabled by the decrease reminiscence footprint of weight-only FP4.

Accuracy is aggressive with higher-precision baselines. For a 7B mannequin, the analysis crew stories GSM8K = 90.8% and MATH500 = 77.4%, surpassing 16-bit LoRA and QLoRA below their setup and matching full-parameter fine-tuning. Across broader math benchmarks (e.g., BigMath), QeRL maintains parity or benefit, whereas converging quicker due to improved exploration.

What that is—and isn’t?

QeRL is weight-only FP4 with LoRA updates; it does not declare FP4 precision for logits/gradients. The advantages focus in rollout/prefill throughput and reminiscence footprint, with empirical proof that quantization-induced entropy aids RL exploration when AQN modulates it over coaching. Generalization to modalities past math-reasoning duties or to security/tool-use RL is determined by reward design and sequence lengths.

Key Takeaways

QeRL combines NVFP4 4-bit weight quantization with LoRA to speed up the rollout part and reduce reminiscence, enabling RL for a 32B LLM on a single H100-80GB.
Quantization acts as exploration: FP4 will increase coverage entropy, whereas Adaptive Quantization Noise (AQN) schedules channel-wise noise through LayerNorm scales.
Reported effectivity: >1.5× rollout speedups vs 16-bit LoRA and ~1.8× end-to-end vs QLoRA; >2× rollout throughput vs QLoRA on 14B/32B setups.
Accuracy holds: Qwen2.5-7B reaches 90.8% on GSM8K and 77.4% on MATH500, matching full-parameter fine-tuning below the paper’s setup.
NVFP4 is a hardware-optimized 4-bit floating format with two-level scaling (FP8 E4M3 block scalers + FP32 tensor scale), enabling environment friendly Marlin-based kernels.

Editorial Comments

QeRL accelerates the RL rollout stage. It quantizes weights to NVFP4 and retains updates and logits in greater precision utilizing LoRA. It stories >1.5× rollout speedups and may prepare a 32B coverage on a single H100-80GB GPU. It provides Adaptive Quantization Noise to make exploration a managed sign throughout coaching. Results are proven primarily on math-reasoning duties utilizing GRPO and DAPO. The positive factors depend on NVFP4 kernel help resembling Marlin.

Check out the FULL CODES here and Paper. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t neglect to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish QeRL: NVFP4-Quantized Reinforcement Learning (RL) Brings 32B LLM Training to a Single H100—While Improving Exploration appeared first on MarkTechPost.

QeRL: NVFP4-Quantized Reinforcement Learning (RL) Brings 32B LLM Training to a Single H100—While Improving Exploration

What QeRL adjustments within the Reinforcement Learning (RL) loop?