|

Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

↔

Prime Intellect has launched prime-rl version 0.6.0. The framework targets reinforcement studying on trillion-parameter Mixture-of-Experts (MoE) fashions. It focuses on heavy agentic workloads, like long-horizon software-engineering duties.

The analysis workforce skilled GLM-5 on SWE duties at up to 131k sequence size. Step instances stayed underneath 5 minutes. The batch measurement was 256 rollouts. The run used solely 28 H200 nodes.

TL;DR

  • prime-rl 0.6.0 trains trillion-parameter MoE fashions on agentic RL workloads.
  • GLM-5 skilled on SWE at 131k sequence size, sub-5-minute steps, 28 H200 nodes.
  • Asynchronous RL disaggregates coach and inference for unbiased optimization.
  • Inference makes use of FP8, Wide EP, P/D disaggregation, KV offloading, and router replay.
  • Training makes use of 3-D parallelism (FSDP, EP, CP) plus block-scaled FP8.

What is prime-rl 0.6.0?

prime-rl is an open framework for asynchronous reinforcement studying. It post-trains giant open-source fashions on agentic duties. Version 0.6.0 extends this to trillion-parameter MoE scale.

The instance mannequin within the announcement is zai-org/GLM-5.1. The optimizations additionally apply to different giant MoE fashions. Examples embrace moonshotai/Kimi-K2.7-Code and nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16.

A full GLM-5.1 run begins with one command on a Slurm cluster.

uv run rl @ examples/glm5_llmd/rl.toml --output-dir /shared/outputs/glm5-llmd

Role of asynchronous RL

Agentic duties have long-tail outliers. Some coding rollouts run for hours. Waiting for them earlier than every coverage replace would idle GPUs.

Asynchronous RL avoids this. The coach and inference methods are disaggregated. They run and scale independently. The inference coverage updates as quickly because the optimizer step finishes.

There is one synchronization level: the coverage replace. prime-rl pushes new weights as quickly as they exist. Already-dispatched rollouts hold their lively prefix cache. So a single rollout might combine tokens from a number of coverage variations.

New rollouts behave otherwise. They repopulate their very own KV cache, even when prefixes match. A KV-cache salt forces this. Requests from too previous a coverage are dropped. The max_off_policy_steps worth controls that threshold.

Inference optimizations

Inference is often the throughput bottleneck in an RL system. prime-rl optimizes for throughput, whereas preserving latency bounded.

FP8 inference: Lower precision hastens prefill and decode. prime-rl makes use of FP8 with DeepEP and DeepGEMM kernels.

Wide Expert Parallelism: Wide EP spreads consultants throughout ≥32 GPUs. It pairs with a big data-parallel rank, for instance 32. Each GPU holds separate consultants and serves as an endpoint. Synchronization occurs per-layer, by dispatch and mix operations.

Prefill and Decode Disaggregation: Some mannequin↔env pairs hit a 4:1 prefill:decode token ratio. Shared staff would inflate end-to-end latency. That reduces the advantages of PipelineRL. P/D disaggregation separates prefill and decode staff. Long device outputs then cease throttling decode staff.

KV cache administration: High concurrency wants giant KV cache house. prime-rl helps tiered offloading to CPU and disk. vLLM native offloading creates one pool per employee. Mooncake Store as an alternative swimming pools RAM and disk throughout all nodes centrally.

Request routing: prime-rl ships a fork of vllm-router by default. It additionally helps the NVIDIA Dynamo router as a drop-in. Routers rating staff utilizing KV cache reuse, queue depth, and stay load.

Router replay (R3): Trainer↔inference mismatch silently kills coaching. Router replay captures inference routing selections. It replays them immediately on the coach. This cuts KL mismatch by roughly an order of magnitude. Routed consultants have form [num_layers, top_k, seq_len]. This payload can develop to tons of of GB. At scale, the info charge reaches tens of Gbps. So prime-rl treats it as an opaque payload. Optimized PyTorch operations deal with the processing.

Training optimizations

The coach builds on torchtitan, a PyTorch-native coaching codebase. It depends on 3-D parallelism: FSDP, CP, and EP. The GLM-5 case examine makes use of all three.

Strategy What it shards Primary use Key element
FSDP (FSDP2) Parameters, gradients, optimizer states Baseline reminiscence amortization Gathers weights on demand per layer through fully_shard
Expert Parallelism (EP) Experts inside a layer Shrinks lively layer reminiscence all2all dispatch/mix; torch-native or DeepEP
Context Parallelism (CP) The sequence dimension Long-context activation reminiscence Ulysses (default) or Ring Attention

EP exists as a result of layers keep enormous after FSDP. With 78 layers and 800B params in float32, one layer’s all-gather wants roughly 40GB. Overlapping one layer pushes that close to 80GB. Setting EP=8 dispatches tokens as an alternative of gathering full consultants. torch-native all2all is barely quicker inside one node. DeepEP wins when EP spans a number of nodes.

CP issues at 131k+ sequence size. There, activations dominate reminiscence, not parameters. GLM-5 makes use of DSA, which neither Ulysses nor Ring Attention parallelizes immediately. So prime-rl ships a customized context-parallel implementation for it.

FP8 coaching. prime-rl makes use of DeepGEMM block-scaled FP8, as proposed by DeepSearch V3. This hardly ever raises throughput, due to quantization overhead. Its actual worth is matching coach and inference precision. That reduces KL mismatch and stabilizes coaching.

Interactive Explainer

Use circumstances with examples

  • Long-horizon SWE brokers: Train a mannequin on actual repository points. Rollouts can span 100s of turns and gear calls. P/D disaggregation retains decode latency predictable right here.
  • 1T-scale post-training on fewer nodes: The GLM-5 run match on 28 H200 nodes. Wide EP and KV offloading increase concurrency and throughput.
  • Stable agentic RL at scale: Router replay and FP8 coaching each scale back coach↔inference KL mismatch. Lower mismatch means steadier coaching.


Check out the Technical detailsAlso, be at liberty to observe us on Twitter and don’t overlook to be part of our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads appeared first on MarkTechPost.

Similar Posts