NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents
NVIDIA has launched Nemotron 3 Ultra, the most important mannequin in its Nemotron 3 household. It targets a particular downside: long-running brokers that plan, name instruments, and cause throughout many turns. As brokers run longer, token counts develop and inference price climbs. Nemotron 3 Ultra is designed to maintain accuracy excessive whereas making that inference sooner and cheaper.
What is Nemotron 3 Ultra
Nemotron 3 Ultra is a 550 billion whole parameter Mixture-of-Experts (MoE) mannequin. Only 55 billion parameters are lively per token. The MoE design improves accuracy per lively parameter.
It makes use of a hybrid Mamba-Attention structure as a substitute of a pure Transformer. Mamba layers deal with lengthy sequences with sub-quadratic scaling. A couple of Attention layers are stored for exact recall over giant contexts.
The mannequin was pre-trained on 20 trillion textual content tokens. Context was then prolonged to 1 million tokens. It was post-trained utilizing Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD).
NVIDIA crew studies as much as roughly 6x greater inference throughput than comparable open LLMs, at on-par accuracy.

The Architecture
The mannequin has 108 layers and a mannequin dimension of 8,192. It makes use of 64 question heads and solely 2 key-value heads, which retains the KV cache small. Each MoE layer holds 512 specialists, with the highest 22 activated per token.
Three design decisions stand out:
- LatentMoE routes specialists extra effectively. It buys extra routed specialists at fastened inference price by buying and selling away hidden-dimension width. NVIDIA crew studies higher accuracy per parameter than customary granular MoEs.
- Multi-Token Prediction (MTP) predicts a number of future tokens in a single ahead go. It allows native speculative decoding for sooner era. Two MTP heads share parameters throughout coaching.
- NVFP4 pre-training makes use of the E2M1 4-bit datatype with two-dimensional block quantization on weights. NVIDIA crew calls this the largest-scale demonstration of steady, correct NVFP4 coaching so far.
The hybrid Mamba-Attention stack are fairly necessary for brokers. Mamba’s per-step decode price stays fixed as sequence size grows. That is why throughput positive factors widen on lengthy, decode-heavy workloads.
Pretraining and the Data Release
Pretraining used a Warmup-Stable-Decay studying charge schedule over 20 trillion tokens. It was cut up into two phases. The first 15 trillion tokens biased for range. The ultimate 5 trillion biased for high-quality information.
NVIDIA crew additionally launched new domain-specific pretraining datasets. These embrace 173 billion refreshed GitHub code tokens. In a Nemotron 3 Nano ablation, an artificial authorized set raised a proxy AuthorizedBench common from 64.6 to 74.7. In an identical ablation, a Wiki-based fact-seeking set raised proxy SimpleQA from 40.2 to 50.2.
The post-training launch can also be giant. NVIDIA provides 10 million new SFT samples and 1 million new RL duties. It provides 15 new RL environments. Cumulative Nemotron open totals attain 50M SFT samples, 2M RL duties, and 55 RL environments.
Training was not fully clean. NVIDIA paperwork two loss divergences and treats them as a helpful engineering file. The first, close to 8 trillion tokens, traced to transferring output-layer gradient discount from FP32 to BF16. The MTP gradient contribution was successfully misplaced in BF16’s 7 mantissa bits. Reverting to FP32 gradient discount re-stabilized coaching.
The second divergence, close to 16 trillion tokens, had no confirmed root trigger. NVIDIA mitigated it by annealing the educational charge early. It then reduce the entire token horizon to twenty trillion tokens.
Post-Training: SFT, RLVR, and MOPD
The post-training pipeline runs SFT, then unified RLVR, then MOPD warmup, MOPD, and MTP Boosting. The entire loop can repeat for a number of cycles.
RLVR stands for Reinforcement Learning with Verifiable Reward. It trains throughout many environments directly: terminal use, software program engineering, search, math, code, security, and extra. The reward in these settings is commonly sparse and environment-dependent.
MOPD is the primary new post-training technique. Mixed-environment RLVR dilutes the educational sign because the variety of environments grows. To tackle this, NVIDIA crew educated greater than ten domain-specialized trainer fashions. Each trainer has its personal coaching pipeline.
During MOPD, the coed mannequin generates its personal rollouts throughout domains. Each rollout is scored by the matching trainer with dense, token-level steerage. This is a denser sign than RLVR’s sparse rewards. The course of runs asynchronously, with rollout era, trainer scoring, and pupil updates pipelined.
MOPD can also be iterative. After one MOPD checkpoint, new lecturers are initialized from the improved pupil. Their positive factors merge again into the subsequent spherical. NVIDIA crew ran two MOPD iterations for Nemotron 3 Ultra.
One sensible caveat is value noting. MOPD works finest when pupil rollouts keep throughout the trainer’s assist. A short SFT warmup aligns the 2 distributions first. NVIDIA crew discovered positive factors are smaller on self-contained reasoning duties the coed not often samples.
Reasoning Effort Control
Nemotron 3 Ultra helps three reasoning modes: reasoning-off, common, and medium-effort. The common and medium modes additionally settle for an inference-time funds management.
Medium-effort is the effectivity lever. NVIDIA crew studies it makes use of about 2.5x fewer tokens than common mode. The price is roughly a 7% drop in accuracy. For high-volume agent steps, that commerce can decrease spend meaningfully.
The Benchmark Case
The comparisons within the NVIDIA’s analysis report use GLM-5.1 (754B), Kimi-K2.6 (1T), and Qwen-3.5 (397B), amongst others. The image is aggressive fairly than dominant.
On agentic duties, Nemotron 3 Ultra posts 90.0 on PinchBench and 56.0 on ProfBench (Search). NVIDIA crew reserved each as held-out generalization gates, scored solely as soon as on the ultimate mannequin. It scores 71.9 on SWE-Bench Verified and 56.4 on Terminal Bench 2.1. On Terminal Bench, Kimi-K2.6 leads at 67.2.
On reasoning, it scores 570.0 on IOI 2025. NVIDIA crew frames this as top-3-human-level aggressive programming. On AA-Omniscience, it data the very best non-hallucination rating within the set at 78.7. That suggests a decrease tendency to reply when unsure.
Long context holds up at scale. The mannequin scores 94.7 on RULER at 1 million tokens. Several bigger comparability fashions high out at 256K context.
On an 8K enter / 64K output setting at NVFP4 on GB200, Nemotron 3 Ultra reaches 5.9x the throughput of GLM-5.1. It is 4.8x sooner than Kimi-K2.6 and 1.6x sooner than Qwen-3.5. Note: Nemotron’s numbers use TRT-LLM, whereas the others use vLLM.
The trade-off is seen on prefill-heavy work. On a 50K enter / 2K output setting, it trails Qwen-3.5, as a result of prefill price tracks lively parameters. NVIDIA crew additionally studies as much as 30% decrease price to process completion, from fewer tokens per activate SWE-Bench and Terminal Bench.
NVIDIA crew additionally stresses harness robustness. The mannequin is educated beneath a number of agent harnesses per process kind, not one. SWE-Bench Verified scores keep between 65% and 70.4% throughout Pi, OpenArms, Hermes, OpenCode, and Mini SWE Agent. The objective is constant habits no matter deployment framework.
Quantization and Deployment
NVIDIA crew ships a single NVFP4 checkpoint. On Blackwell it runs with native FP4 math. On Hopper it runs as W4A16, since Hopper lacks native FP4 tensor cores.
The ultimate resolution operates at 5.03 bits-per-element. It mixes NVFP4 routed specialists with FP8 layers for shared specialists and Mamba linears. Attention layers keep in BF16. NVIDIA crew discovered accuracy saturated beneath this funds, so greater precision added no measurable acquire.
The diminished weight footprint has a deployment profit. The W4A16 path leaves room to suit MTP weights on a single 8-GPU H100 node. An FP8 checkpoint couldn’t, with out spanning two nodes.
Key Takeaways
- Nemotron 3 Ultra is a 550B open MoE (55B lively) utilizing a hybrid Mamba-Attention design for long-running brokers.
- NVIDIA studies as much as ~6x greater inference throughput than comparable open LLMs at on-par accuracy (5.9x vs GLM-5.1 on 8K/64K).
- It pairs a 1M-token context with the very best non-hallucination rating in its comparability set (78.7 on AA-Omniscience).
- Post-training facilities on Multi-teacher On-Policy Distillation (MOPD), distilling 10+ specialised lecturers into one pupil.
- Weights, coaching information, and recipes ship overtly beneath OpenMDW-1.1, with one NVFP4 checkpoint for Blackwell, Hopper, and Ampere.
Marktechpost’s Visual Explainer
Where to Use Nemotron 3 Ultra
Where to Use Nemotron 3 Ultra: Inference Providers
Verified internet hosting and entry factors for NVIDIA's open 550B-A55B mannequin. Each card opens in a brand new tab.
(*3*)
Check out the Paper, Model Weights and Technical details. Also, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The publish NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents appeared first on MarkTechPost.
