|

NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model Built on a Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone

NVIDIA has launched Nemotron-Labs-TwoTower, a diffusion language mannequin constructed on a pretrained autoregressive spine. It ships as open weights underneath the NVIDIA Nemotron Open Model License. The launch targets a throughput bottleneck in textual content era.

Autoregressive (AR) fashions decode one token at a time. That serial course of caps era throughput. Discrete diffusion language fashions take one other route. They generate tokens in parallel and refine them iteratively.

Most diffusion language fashions use one community for 2 jobs. It represents clear tokens and denoises corrupted ones at each step. TwoTower separates these jobs into two towers. It retains 98.7% of the AR baseline’s mixture benchmark high quality. It additionally experiences 2.42× larger wall-clock era throughput.

TL;DR

  • TwoTower splits diffusion into a frozen AR context tower and a skilled denoiser tower.
  • It retains 98.7% of AR high quality at 2.42× throughput (γ=0.8, S=16, 2×H100).
  • The denoiser skilled on ~2.1T tokens; the spine used 25T.
  • One checkpoint runs diffusion, mock-AR, and AR decoding modes.

Nemotron-Labs-TwoTower

TwoTower is a block-wise autoregressive diffusion mannequin. It is instantiated on Nemotron-3-Nano-30B-A3B, an open-weight hybrid spine. That spine interleaves Mamba-2, self-attention, and mixture-of-experts (MoE) layers.

Each tower has 52 layers: 23 Mamba-2, 6 self-attention, and 23 MoE. The launched checkpoint ships each towers, roughly 60B whole parameters. Active parameters per token are about 3B per tower. The MoE makes use of 128 routable consultants, of which 6 activate, plus 2 shared consultants.

Both towers begin as copies of the identical spine checkpoint. Only the denoiser tower is skilled. The AR context tower stays frozen. The denoiser was skilled on ~2.1T tokens, a fraction of the spine’s 25T-token pretraining.

How the Two Towers Work

The AR context tower runs causally over the immediate and dedicated tokens. It produces per-layer KV cache and last Mamba-2 states. It preserves the spine’s autoregressive functionality.

The diffusion denoiser tower refines noisy blocks. Within a block, it makes use of bidirectional in-block consideration. It stays causal with respect to previous clear blocks.

The towers join layer-by-layer. Denoiser layer i cross-attends to context tower layer i. This layer-aligned cross-attention offers multi-scale entry to the spine’s representations. Prior approaches broadcast solely the final hidden state.

Two extra denoiser modifications matter. Mamba-2 layers seed their preliminary state from the context tower’s Mamba state. The diffusion timestep modulates every layer via adaLN-single time conditioning. That adaLN module provides solely ~1.5M parameters.

Generation runs block by block. Each block begins as S [MASK] tokens. The denoiser refines it over T steps, then commits it. The context tower then processes dedicated tokens to replace its caches.

This explains why a number of denoising steps can nonetheless beat one-token decoding. Autoregressive decoding commits precisely one token per step. TwoTower commits a number of tokens per step early in refinement.

Benchmarks

Evaluations use BF16 on 2×H100 GPUs. The default working level is confidence unmasking, threshold γ=0.8, block dimension S=16. The desk compares the AR baseline towards TwoTower diffusion decoding.

Task Nemotron-3-Nano-30B-A3B (AR) Nemotron-Labs-TwoTower (diffusion)
MMLU (5-shot, acc) 78.56 78.24
MMLU-Pro (5-shot, CoT EM) 62.59 60.93
ARC-Challenge (25-shot, acc_norm) 91.72 92.66
WinoGrande (5-shot, acc) 76.09 76.09
RACE (0-shot, acc) 88.90 88.90
HumanEval (0-shot) 79.27 75.58
MBPP-Sanitized (3-shot) 74.71 74.28
GSM8K (8-shot, acc) 92.49 90.14
MATH-500 (4-shot) 84.40 80.60
MMLU Global Lite (5-shot) 73.97 73.94
MGSM (8-shot, avg acc) 80.80 80.40
Quality retained 100% 98.7%
Generation throughput (× AR) 1.0× 2.42×

General data stays inside about one level of the AR baseline. Code and math present modest degradation. Commonsense and multilingual scores are recovered or barely improved. Lowering γ commits extra tokens per step and raises throughput, with diminished high quality.

Running It: Three Generation Modes

The checkpoint exposes three inference paths. Full two-tower diffusion makes use of 2 GPUs, about 59GB per GPU in BF16. AR-only mode runs on a single 80GB GPU.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
mannequin = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, trust_remote_code=True,
)
# context tower -> GPU 0, denoiser tower -> GPU 1
mannequin.place_towers_on_devices("cuda:0", "cuda:1")
mannequin.eval()

immediate = "France is a nation "
inputs = tokenizer(immediate, return_tensors="pt").to("cuda:0")

outputs = mannequin.generate_mask_diffusion(
    inputs["input_ids"], max_new_tokens=128,
    block_size=16, steps_per_block=16, mask_token_id=3,
    temperature=0.1, confidence_threshold=0.8,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].form[1]:], skip_special_tokens=True))

The three modes are generate_mask_diffusion(), generate_mock_ar(), and generate_ar(). Mask diffusion commits as much as block_size tokens per step. Mock-AR and AR commit one token per step.

Where It Fits: Use Cases

The most direct use case is quicker batch era. An information workforce producing artificial textual content can commerce a small high quality drop for throughput. At γ=0.8, that commerce is 1.3% high quality for two.42× pace.

A second use case is tuning the standard–throughput trade-off. Raising γ preserves extra high quality, as per the NVIDIA’s paper. Lowering γ commits extra tokens per step for pace.

A 3rd use case is drop-in adaptation. The context tower retains its LM head for speculative decoding, verification, or AR scoring. Teams can run AR and diffusion from one checkpoint.

Strengths and Weaknesses

Strengths:

  • Open weights underneath the NVIDIA Nemotron Open Model License; prepared for industrial use
  • 98.7% of AR high quality retained at 2.42× throughput on the default working level
  • One checkpoint helps diffusion, mock-AR, and AR decoding
  • Denoiser skilled on ~2.1T tokens, not a full re-pretrain
  • Sequence-length cache reminiscence scales just like the AR baseline

Weaknesses:

  • Full two-tower diffusion wants 2 GPUs and ~59GB per GPU in BF16
  • Code and math degrade greater than common data (HumanEval 79.27 → 75.58)
  • Keeping each towers resident raises the mounted model-weight reminiscence footprint
  • Released checkpoint is a base mannequin, earlier than instruction tuning or alignment
  • Throughput previous 3× comes with bigger high quality loss

Interactive Explainer