NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents

NVIDIA has launched Nemotron 3 Ultra, the most important mannequin in its Nemotron 3 household. It targets a particular downside: long-running brokers that plan, name instruments, and cause throughout many turns. As brokers run longer, token counts develop and inference price climbs. Nemotron 3 Ultra is designed to maintain accuracy excessive whereas making that inference sooner and cheaper.

What is Nemotron 3 Ultra

Nemotron 3 Ultra is a 550 billion whole parameter Mixture-of-Experts (MoE) mannequin. Only 55 billion parameters are lively per token. The MoE design improves accuracy per lively parameter.

It makes use of a hybrid Mamba-Attention structure as a substitute of a pure Transformer. Mamba layers deal with lengthy sequences with sub-quadratic scaling. A couple of Attention layers are stored for exact recall over giant contexts.

The mannequin was pre-trained on 20 trillion textual content tokens. Context was then prolonged to 1 million tokens. It was post-trained utilizing Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD).

NVIDIA crew studies as much as roughly 6x greater inference throughput than comparable open LLMs, at on-par accuracy.

https://analysis.nvidia.com/labs/nemotron/recordsdata/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf

The Architecture

The mannequin has 108 layers and a mannequin dimension of 8,192. It makes use of 64 question heads and solely 2 key-value heads, which retains the KV cache small. Each MoE layer holds 512 specialists, with the highest 22 activated per token.

Three design decisions stand out:

LatentMoE routes specialists extra effectively. It buys extra routed specialists at fastened inference price by buying and selling away hidden-dimension width. NVIDIA crew studies higher accuracy per parameter than customary granular MoEs.
Multi-Token Prediction (MTP) predicts a number of future tokens in a single ahead go. It allows native speculative decoding for sooner era. Two MTP heads share parameters throughout coaching.
NVFP4 pre-training makes use of the E2M1 4-bit datatype with two-dimensional block quantization on weights. NVIDIA crew calls this the largest-scale demonstration of steady, correct NVFP4 coaching so far.

The hybrid Mamba-Attention stack are fairly necessary for brokers. Mamba’s per-step decode price stays fixed as sequence size grows. That is why throughput positive factors widen on lengthy, decode-heavy workloads.

Pretraining and the Data Release

Pretraining used a Warmup-Stable-Decay studying charge schedule over 20 trillion tokens. It was cut up into two phases. The first 15 trillion tokens biased for range. The ultimate 5 trillion biased for high-quality information.

NVIDIA crew additionally launched new domain-specific pretraining datasets. These embrace 173 billion refreshed GitHub code tokens. In a Nemotron 3 Nano ablation, an artificial authorized set raised a proxy AuthorizedBench common from 64.6 to 74.7. In an identical ablation, a Wiki-based fact-seeking set raised proxy SimpleQA from 40.2 to 50.2.

The post-training launch can also be giant. NVIDIA provides 10 million new SFT samples and 1 million new RL duties. It provides 15 new RL environments. Cumulative Nemotron open totals attain 50M SFT samples, 2M RL duties, and 55 RL environments.

Training was not fully clean. NVIDIA paperwork two loss divergences and treats them as a helpful engineering file. The first, close to 8 trillion tokens, traced to transferring output-layer gradient discount from FP32 to BF16. The MTP gradient contribution was successfully misplaced in BF16’s 7 mantissa bits. Reverting to FP32 gradient discount re-stabilized coaching.

The second divergence, close to 16 trillion tokens, had no confirmed root trigger. NVIDIA mitigated it by annealing the educational charge early. It then reduce the entire token horizon to twenty trillion tokens.

Post-Training: SFT, RLVR, and MOPD

The post-training pipeline runs SFT, then unified RLVR, then MOPD warmup, MOPD, and MTP Boosting. The entire loop can repeat for a number of cycles.

RLVR stands for Reinforcement Learning with Verifiable Reward. It trains throughout many environments directly: terminal use, software program engineering, search, math, code, security, and extra. The reward in these settings is commonly sparse and environment-dependent.

MOPD is the primary new post-training technique. Mixed-environment RLVR dilutes the educational sign because the variety of environments grows. To tackle this, NVIDIA crew educated greater than ten domain-specialized trainer fashions. Each trainer has its personal coaching pipeline.

During MOPD, the coed mannequin generates its personal rollouts throughout domains. Each rollout is scored by the matching trainer with dense, token-level steerage. This is a denser sign than RLVR’s sparse rewards. The course of runs asynchronously, with rollout era, trainer scoring, and pupil updates pipelined.

MOPD can also be iterative. After one MOPD checkpoint, new lecturers are initialized from the improved pupil. Their positive factors merge again into the subsequent spherical. NVIDIA crew ran two MOPD iterations for Nemotron 3 Ultra.

One sensible caveat is value noting. MOPD works finest when pupil rollouts keep throughout the trainer’s assist. A short SFT warmup aligns the 2 distributions first. NVIDIA crew discovered positive factors are smaller on self-contained reasoning duties the coed not often samples.

Reasoning Effort Control

Nemotron 3 Ultra helps three reasoning modes: reasoning-off, common, and medium-effort. The common and medium modes additionally settle for an inference-time funds management.

Medium-effort is the effectivity lever. NVIDIA crew studies it makes use of about 2.5x fewer tokens than common mode. The price is roughly a 7% drop in accuracy. For high-volume agent steps, that commerce can decrease spend meaningfully.

The Benchmark Case

The comparisons within the NVIDIA’s analysis report use GLM-5.1 (754B), Kimi-K2.6 (1T), and Qwen-3.5 (397B), amongst others. The image is aggressive fairly than dominant.

On agentic duties, Nemotron 3 Ultra posts 90.0 on PinchBench and 56.0 on ProfBench (Search). NVIDIA crew reserved each as held-out generalization gates, scored solely as soon as on the ultimate mannequin. It scores 71.9 on SWE-Bench Verified and 56.4 on Terminal Bench 2.1. On Terminal Bench, Kimi-K2.6 leads at 67.2.

On reasoning, it scores 570.0 on IOI 2025. NVIDIA crew frames this as top-3-human-level aggressive programming. On AA-Omniscience, it data the very best non-hallucination rating within the set at 78.7. That suggests a decrease tendency to reply when unsure.

Long context holds up at scale. The mannequin scores 94.7 on RULER at 1 million tokens. Several bigger comparability fashions high out at 256K context.

On an 8K enter / 64K output setting at NVFP4 on GB200, Nemotron 3 Ultra reaches 5.9x the throughput of GLM-5.1. It is 4.8x sooner than Kimi-K2.6 and 1.6x sooner than Qwen-3.5. Note: Nemotron’s numbers use TRT-LLM, whereas the others use vLLM.

The trade-off is seen on prefill-heavy work. On a 50K enter / 2K output setting, it trails Qwen-3.5, as a result of prefill price tracks lively parameters. NVIDIA crew additionally studies as much as 30% decrease price to process completion, from fewer tokens per activate SWE-Bench and Terminal Bench.

NVIDIA crew additionally stresses harness robustness. The mannequin is educated beneath a number of agent harnesses per process kind, not one. SWE-Bench Verified scores keep between 65% and 70.4% throughout Pi, OpenArms, Hermes, OpenCode, and Mini SWE Agent. The objective is constant habits no matter deployment framework.

Quantization and Deployment

NVIDIA crew ships a single NVFP4 checkpoint. On Blackwell it runs with native FP4 math. On Hopper it runs as W4A16, since Hopper lacks native FP4 tensor cores.

The ultimate resolution operates at 5.03 bits-per-element. It mixes NVFP4 routed specialists with FP8 layers for shared specialists and Mamba linears. Attention layers keep in BF16. NVIDIA crew discovered accuracy saturated beneath this funds, so greater precision added no measurable acquire.

The diminished weight footprint has a deployment profit. The W4A16 path leaves room to suit MTP weights on a single 8-GPU H100 node. An FP8 checkpoint couldn’t, with out spanning two nodes.

Key Takeaways

Nemotron 3 Ultra is a 550B open MoE (55B lively) utilizing a hybrid Mamba-Attention design for long-running brokers.
NVIDIA studies as much as ~6x greater inference throughput than comparable open LLMs at on-par accuracy (5.9x vs GLM-5.1 on 8K/64K).
It pairs a 1M-token context with the very best non-hallucination rating in its comparability set (78.7 on AA-Omniscience).
Post-training facilities on Multi-teacher On-Policy Distillation (MOPD), distilling 10+ specialised lecturers into one pupil.
Weights, coaching information, and recipes ship overtly beneath OpenMDW-1.1, with one NVFP4 checkpoint for Blackwell, Hopper, and Ampere.

Marktechpost’s Visual Explainer

NVIDIA Nemotron 3 Ultra

SLIDE 1 / 8

Open Model Release

Nemotron 3 Ultra: a 550B open MoE constructed for long-running brokers

An open Mixture-of-Experts hybrid Mamba-Transformer for agentic reasoning, device use, and long-context duties.

Total / Active

550B / 55B

Sparse MoE, 55B lively per token

Context

1M tokens

Extended after 20T-token pretraining

Throughput

~6x

Up to ~6x vs comparable open LLMs

License

OpenMDW-1.1

Open weights, information, and recipes

Pre-trained on 20T tokens, then post-trained with SFT, RLVR, and Multi-teacher On-Policy Distillation (MOPD).

What It Is

A hybrid Mamba-Attention MoE, not a pure Transformer

Hybrid stack: Mamba layers scale sub-quadratically; a couple of Attention layers protect exact recall.
Sparse MoE: 550B whole parameters, 55B lively per token, enhancing accuracy per lively parameter.
Long context: pretrained on 20T textual content tokens, then prolonged to a 1M-token window.
Open launch: base, post-trained, and NVFP4 checkpoints, plus coaching information and recipes.

Throughput positive factors come primarily from the hybrid Mamba-Attention design, which bounds KV-cache footprint.

Architecture

108 layers, 512 specialists per layer, top-22 routing

Layers

108

Model dimension 8,192

Attention

64 / 2

Query heads / KV heads

Experts

512

Top-22 activated per token

Precision

NVFP4

E2M1, 2D block quantization

Key methods

LatentMoE: extra routed specialists at fastened inference price by buying and selling hidden-dimension width.
Multi-Token Prediction (MTP): predicts a number of tokens per go; two heads share parameters.
NVFP4 pre-training: NVIDIA’s largest-scale steady, correct FP4 coaching run so far.

Pretraining & Data

20T tokens in two phases, plus new open datasets

Two-phase curriculum: 15T tokens biased for range, then 5T biased for high quality.
Code refresh: 173B new GitHub tokens with a September 30, 2025 cutoff.
Domain information (Nano ablations): authorized lifted proxy AuthorizedBench 64.6 to 74.7; Wiki lifted proxy SimpleQA 40.2 to 50.2.
Post-training information: +10M SFT samples and +1M RL duties; totals attain 50M SFT, 2M RL duties, 55 environments.

NVIDIA paperwork two loss divergences (close to 8T and 16T tokens) and the fixes used to stabilize coaching.

Post-Training

MOPD: distilling 10+ specialised lecturers into one pupil

SFT→
RLVR→
MOPD Warmup→
MOPD→
MTP Boosting

Why MOPD: mixed-environment RLVR dilutes the sign because the variety of environments grows.
How it really works: the coed generates rollouts; every trainer scores them with dense token-level steerage.
Asynchronous: rollout era, trainer scoring, and pupil updates run pipelined.
Iterative: NVIDIA ran two MOPD iterations, re-initializing lecturers from the improved pupil.

A brief SFT warmup retains pupil rollouts inside every trainer’s assist earlier than distillation.

Benchmarks

Competitive throughout agentic, reasoning, and long-context duties

PinchBench (held-out)

90.0

Top tier of evaluated open fashions

SWE-Bench Verified

71.9

Software engineering brokers

IOI 2025

570.0

Top-3-human-level (NVIDIA framing)

RULER @ 1M

94.7

Long-context retrieval

ProfBench (Search): 56.0, the second held-out generalization gate.
AA-Omniscience: highest non-hallucination rating within the set at 78.7.
Terminal Bench 2.1: 56.4, the place Kimi-K2.6 leads at 67.2.

Throughput & Efficiency

Faster on decode-heavy work; funds management for price

vs GLM-5.1

5.9x

8K in / 64K out, NVFP4 on GB200

vs Kimi-K2.6

4.8x

Same decode-heavy setting

vs Qwen-3.5

1.6x

Trails Qwen on prefill-heavy work

Cost to finish

~30%

Lower, from fewer tokens per flip

Reasoning modes: reasoning-off, common, and medium-effort, with inference-time funds management.
Medium-effort: about 2.5x fewer tokens for roughly a 7% accuracy drop.

Throughput is reported with TRT-LLM for Nemotron and vLLM for the opposite fashions.

Quantization, Licensing & Takeaways

One NVFP4 checkpoint, throughout NVIDIA GPU generations

Single checkpoint: native FP4 on Blackwell, W4A16 on Hopper, additionally runs on Ampere.
Operating level: 5.03 bits-per-element, mixing NVFP4 specialists with FP8 and BF16 layers.
Footprint win: the W4A16 path suits MTP weights on a single 8-GPU H100 node.
Fully open: weights, information, and recipes beneath OpenMDW-1.1; fine-tune by way of LoRA, SFT, or RL.

Not the highest scorer on each benchmark. The design favors throughput, lengthy context, and reliability for brokers.

Curated by Marktechpost — AI/ML analysis & dev information for engineers and information scientists

Sources: NVIDIA Nemotron 3 Ultra technical report & weblog · Verified Jun 4, 2026