Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

Prime Intellect has launched prime-rl version 0.6.0. The framework targets reinforcement studying on trillion-parameter Mixture-of-Experts (MoE) fashions. It focuses on heavy agentic workloads, like long-horizon software-engineering duties.

The analysis workforce skilled GLM-5 on SWE duties at up to 131k sequence size. Step instances stayed underneath 5 minutes. The batch measurement was 256 rollouts. The run used solely 28 H200 nodes.

TL;DR

prime-rl 0.6.0 trains trillion-parameter MoE fashions on agentic RL workloads.
GLM-5 skilled on SWE at 131k sequence size, sub-5-minute steps, 28 H200 nodes.
Asynchronous RL disaggregates coach and inference for unbiased optimization.
Inference makes use of FP8, Wide EP, P/D disaggregation, KV offloading, and router replay.
Training makes use of 3-D parallelism (FSDP, EP, CP) plus block-scaled FP8.

What is prime-rl 0.6.0?

prime-rl is an open framework for asynchronous reinforcement studying. It post-trains giant open-source fashions on agentic duties. Version 0.6.0 extends this to trillion-parameter MoE scale.

The instance mannequin within the announcement is zai-org/GLM-5.1. The optimizations additionally apply to different giant MoE fashions. Examples embrace moonshotai/Kimi-K2.7-Code and nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16.

A full GLM-5.1 run begins with one command on a Slurm cluster.

Copy Code

uv run rl @ examples/glm5_llmd/rl.toml --output-dir /shared/outputs/glm5-llmd

Role of asynchronous RL

Agentic duties have long-tail outliers. Some coding rollouts run for hours. Waiting for them earlier than every coverage replace would idle GPUs.

Asynchronous RL avoids this. The coach and inference methods are disaggregated. They run and scale independently. The inference coverage updates as quickly because the optimizer step finishes.

There is one synchronization level: the coverage replace. prime-rl pushes new weights as quickly as they exist. Already-dispatched rollouts hold their lively prefix cache. So a single rollout might combine tokens from a number of coverage variations.

New rollouts behave otherwise. They repopulate their very own KV cache, even when prefixes match. A KV-cache salt forces this. Requests from too previous a coverage are dropped. The max_off_policy_steps worth controls that threshold.

Inference optimizations

Inference is often the throughput bottleneck in an RL system. prime-rl optimizes for throughput, whereas preserving latency bounded.

FP8 inference: Lower precision hastens prefill and decode. prime-rl makes use of FP8 with DeepEP and DeepGEMM kernels.

Wide Expert Parallelism: Wide EP spreads consultants throughout ≥32 GPUs. It pairs with a big data-parallel rank, for instance 32. Each GPU holds separate consultants and serves as an endpoint. Synchronization occurs per-layer, by dispatch and mix operations.

Prefill and Decode Disaggregation: Some mannequinenv pairs hit a 4:1 prefill:decode token ratio. Shared staff would inflate end-to-end latency. That reduces the advantages of PipelineRL. P/D disaggregation separates prefill and decode staff. Long device outputs then cease throttling decode staff.

KV cache administration: High concurrency wants giant KV cache house. prime-rl helps tiered offloading to CPU and disk. vLLM native offloading creates one pool per employee. Mooncake Store as an alternative swimming pools RAM and disk throughout all nodes centrally.

Request routing: prime-rl ships a fork of vllm-router by default. It additionally helps the NVIDIA Dynamo router as a drop-in. Routers rating staff utilizing KV cache reuse, queue depth, and stay load.

Router replay (R3): Trainerinference mismatch silently kills coaching. Router replay captures inference routing selections. It replays them immediately on the coach. This cuts KL mismatch by roughly an order of magnitude. Routed consultants have form [num_layers, top_k, seq_len]. This payload can develop to tons of of GB. At scale, the info charge reaches tens of Gbps. So prime-rl treats it as an opaque payload. Optimized PyTorch operations deal with the processing.

Training optimizations

The coach builds on torchtitan, a PyTorch-native coaching codebase. It depends on 3-D parallelism: FSDP, CP, and EP. The GLM-5 case examine makes use of all three.

Strategy	What it shards	Primary use	Key element
FSDP (FSDP2)	Parameters, gradients, optimizer states	Baseline reminiscence amortization	Gathers weights on demand per layer through `fully_shard`
Expert Parallelism (EP)	Experts inside a layer	Shrinks lively layer reminiscence	`all2all` dispatch/mix; torch-native or DeepEP
Context Parallelism (CP)	The sequence dimension	Long-context activation reminiscence	Ulysses (default) or Ring Attention

EP exists as a result of layers keep enormous after FSDP. With 78 layers and 800B params in float32, one layer’s all-gather wants roughly 40GB. Overlapping one layer pushes that close to 80GB. Setting EP=8 dispatches tokens as an alternative of gathering full consultants. torch-native all2all is barely quicker inside one node. DeepEP wins when EP spans a number of nodes.

CP issues at 131k+ sequence size. There, activations dominate reminiscence, not parameters. GLM-5 makes use of DSA, which neither Ulysses nor Ring Attention parallelizes immediately. So prime-rl ships a customized context-parallel implementation for it.

FP8 coaching. prime-rl makes use of DeepGEMM block-scaled FP8, as proposed by DeepSearch V3. This hardly ever raises throughput, due to quantization overhead. Its actual worth is matching coach and inference precision. That reduces KL mismatch and stabilizes coaching.

Interactive Explainer

Use circumstances with examples

Long-horizon SWE brokers: Train a mannequin on actual repository points. Rollouts can span 100s of turns and gear calls. P/D disaggregation retains decode latency predictable right here.
1T-scale post-training on fewer nodes: The GLM-5 run match on 28 H200 nodes. Wide EP and KV offloading increase concurrency and throughput.
Stable agentic RL at scale: Router replay and FP8 coaching each scale back coachinference KL mismatch. Lower mismatch means steadier coaching.

Check out the Technical details. Also, be at liberty to observe us on Twitter and don’t overlook to be part of our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads appeared first on MarkTechPost.

Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

TL;DR

What is prime-rl 0.6.0?

Role of asynchronous RL

Inference optimizations

Training optimizations

Interactive Explainer

Use circumstances with examples

Google DeepMind Releases GenAI Processors: A Lightweight Python Library that Enables Efficient and Parallel Content Processing

How to Build a Safe, Autonomous Prior Authorization Agent for Healthcare Revenue Cycle Management with Human-in-the-Loop Controls

5 lessons we can learn from Sora: Hype vs reality

Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch

Arcee AI Releases Trinity Large Thinking: An Apache 2.0 Open Reasoning Model for Long-Horizon Agents and Tool Use

Chroma Releases Context-1: A 20B Agentic Search Model for Multi-Hop Retrieval, Context Management, and Scalable Synthetic Task Generation

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

TL;DR

What is prime-rl 0.6.0?

Role of asynchronous RL

Inference optimizations

Training optimizations

Interactive Explainer

Use circumstances with examples

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!