How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp
In this tutorial, we work by means of an implementation of NVIDIA Apex, specializing in the elements that also matter in fashionable GPU coaching workflows. Instead of treating Apex as a normal mixed-precision library, we separate the older elements from the still-useful ones and check them instantly. We start by checking the CUDA runtime, constructing Apex with the required CUDA and C++ extensions, and detecting which fused kernels are literally obtainable within the surroundings. This issues as a result of a Python-only Apex set up can seem profitable whereas silently lacking the high-performance kernels that make Apex helpful. After the setup, we benchmark FusedAdam in opposition to PyTorch AdamW, examine FusedLayerNorm and FusedRMSNorm with normal normalization layers, and run each legacy apex.amp and fashionable torch.amp examples. We then carry every part collectively in a small Transformer coaching experiment, the place we examine a vanilla FP32 PyTorch path with a fused Apex-plus-AMP path to assess the true impact on throughput.
import os, sys, time, subprocess, importlib
import torch
assert torch.cuda.is_available(), (
"No CUDA GPU discovered. In Colab: Runtime > Change runtime sort > Hardware accelerator = GPU"
)
DEV = torch.gadget("cuda")
print(f"[env] torch {torch.__version__} | CUDA {torch.model.cuda} | GPU {torch.cuda.get_device_name(0)}")
def _module_present(identify: str) -> bool:
strive:
importlib.import_module(identify)
return True
besides Exception:
return False
def _build_apex():
print("[apex] constructing from supply with CUDA + C++ extensions "
"(~10-20 min on first run; seize a espresso)...")
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "ninja", "packaging"], verify=True)
if not os.path.isdir("apex"):
subprocess.run(["git", "clone", "--depth", "1",
"https://github.com/NVIDIA/apex"], verify=True)
env = os.environ.copy()
env["APEX_CPP_EXT"] = "1"
env["APEX_CUDA_EXT"] = "1"
env["MAX_JOBS"] = "4"
env["NVCC_APPEND_FLAGS"] = "--threads 4"
cmd = [sys.executable, "-m", "pip", "install", "-v",
"--no-build-isolation", "--no-cache-dir", "./apex"]
proc = subprocess.run(cmd, env=env)
if proc.returncode != 0:
print("[apex] CUDA construct failed -> falling again to PYTHON-ONLY set up "
"(fused kernels shall be unavailable, tutorial nonetheless runs).")
subprocess.run([sys.executable, "-m", "pip", "install", "-v",
"--no-build-isolation", "--no-cache-dir", "./apex"], verify=False)
if not _module_present("amp_C"):
_build_apex()
HAS_AMP_C = _module_present("amp_C")
HAS_FLN = _module_present("fused_layer_norm_cuda")
strive:
import apex
from apex.optimizers import FusedAdam
from apex.normalization import FusedLayerNorm
strive:
from apex.normalization import FusedRMSNorm
HAS_RMS = True
besides Exception:
HAS_RMS = False
from apex import amp
APEX_OK = True
besides Exception as e:
print(f"[apex] import failed: {e}")
APEX_OK = False
print("n[capabilities]")
print(f" apex importable : {APEX_OK}")
print(f" FusedAdam kernels : {HAS_AMP_C}")
print(f" FusedLayerNorm krnl: {HAS_FLN}")
print(f" FusedRMSNorm : {APEX_OK and HAS_RMS}")
print("=" * 78)
def bench(fn, iters=50, warmup=10):
for _ in vary(warmup):
fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in vary(iters):
fn()
torch.cuda.synchronize()
return (time.perf_counter() - t0) / iters * 1e3
We begin by getting ready the CUDA surroundings, checking GPU availability, and printing the energetic PyTorch, CUDA, and GPU particulars. We then construct NVIDIA Apex from supply with CUDA and C++ extensions in order that the fused kernels can be utilized instantly fairly than counting on a restricted Python-only set up. We additionally detect whether or not FusedAdam, FusedLayerNorm, FusedRMSNorm, and legacy AMP can be found, and outline a reusable benchmarking helper for subsequent assessments.
print("n### SECTION A: FusedAdam vs AdamW ###")
def make_many_param_model(n_layers=60, dim=512):
return torch.nn.Sequential(*[torch.nn.Linear(dim, dim) for _ in range(n_layers)]).to(DEV)
def opt_step_factory(optimizer, mannequin, dim=512):
x = torch.randn(64, dim, gadget=DEV)
def step():
optimizer.zero_grad(set_to_none=True)
out = mannequin(x).pow(2).imply()
out.backward()
optimizer.step()
return step
m1 = make_many_param_model()
torch_adam = torch.optim.AdamW(m1.parameters(), lr=1e-3)
ms_torch = bench(opt_step_factory(torch_adam, m1))
print(f" torch.optim.AdamW : {ms_torch:6.2f} ms / step")
if HAS_AMP_C and APEX_OK:
m2 = make_many_param_model()
m2.load_state_dict(m1.state_dict())
fused_adam = FusedAdam(m2.parameters(), lr=1e-3)
ms_fused = bench(opt_step_factory(fused_adam, m2))
print(f" apex.FusedAdam : {ms_fused:6.2f} ms / step "
f"(~{ms_torch/ms_fused:0.2f}x on optimizer-bound step)")
else:
print(" apex.FusedAdam : SKIPPED (cuda ext not constructed)")
We benchmark PyTorch AdamW in opposition to Apex FusedAdam utilizing a mannequin with many linear layers to make optimizer overhead seen. We run the identical optimizer step sample for each strategies, so the comparability focuses on replace pace fairly than mannequin variations. We then report the step time and speedup to assess whether or not the fused multi-tensor optimizer supplies a sensible profit within the present GPU runtime.
print("n### SECTION B: FusedLayerNorm / FusedRMSNorm ###")
B, T, H = 32, 512, 1024
x = torch.randn(B, T, H, gadget=DEV, requires_grad=True)
torch_ln = torch.nn.LayerNorm(H).to(DEV)
def ln_torch():
y = torch_ln(x); y.sum().backward()
ms_ln_torch = bench(ln_torch)
print(f" nn.LayerNorm : {ms_ln_torch:6.2f} ms / fwd+bwd")
if HAS_FLN and APEX_OK:
fused_ln = FusedLayerNorm(H).to(DEV)
with torch.no_grad():
fused_ln.weight.copy_(torch_ln.weight); fused_ln.bias.copy_(torch_ln.bias)
diff = (fused_ln(x.detach()) - torch_ln(x.detach())).abs().max().merchandise()
print(f" max|fused - torch| = {diff:.2e} (must be ~1e-3 or smaller)")
def ln_fused():
y = fused_ln(x); y.sum().backward()
ms_ln_fused = bench(ln_fused)
print(f" apex.FusedLayerNorm: {ms_ln_fused:6.2f} ms / fwd+bwd "
f"(~{ms_ln_torch/ms_ln_fused:0.2f}x)")
if HAS_RMS:
fused_rms = FusedRMSNorm(H).to(DEV)
def rms_fused():
y = fused_rms(x); y.sum().backward()
print(f" apex.FusedRMSNorm : {bench(rms_fused):6.2f} ms / fwd+bwd "
f"(RMSNorm: no mean-subtraction, utilized by LLaMA-style fashions)")
else:
print(" apex.FusedLayerNorm: SKIPPED (cuda ext not constructed)")
We examine the usual PyTorch LayerNorm with Apex FusedLayerNorm on a big tensor resembling transformer hidden states. We first verify numerical correctness by copying the identical affine parameters and measuring the utmost distinction between fused and normal outputs. We then benchmark ahead and backward passes and, when obtainable, check FusedRMSNorm to display how Apex helps normalization layers utilized in LLaMA-style fashions.
print("n### SECTION C: combined precision (apex.amp opt-levels, DEPRECATED) ###")
def tiny_net():
return torch.nn.Sequential(
torch.nn.Linear(256, 256), torch.nn.ReLU(),
torch.nn.Linear(256, 256), torch.nn.ReLU(),
torch.nn.Linear(256, 10),
).to(DEV)
if APEX_OK:
for degree in ["O0", "O1", "O2"]:
web = tiny_net()
optimizer = (FusedAdam(web.parameters(), lr=1e-3) if HAS_AMP_C
else torch.optim.AdamW(web.parameters(), lr=1e-3))
web, optimizer = amp.initialize(web, optimizer, opt_level=degree, verbosity=0)
xb = torch.randn(128, 256, gadget=DEV)
yb = torch.randint(0, 10, (128,), gadget=DEV)
lossfn = torch.nn.CrossEntropyLoss()
for _ in vary(20):
optimizer.zero_grad()
loss = lossfn(web(xb), yb)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
print(f" opt_level={degree}: closing loss = {loss.merchandise():.4f}")
else:
print(" apex.amp: SKIPPED (apex not importable)")
print("n >> Modern really helpful equal (torch.amp, no Apex wanted):")
web = tiny_net()
optimizer = torch.optim.AdamW(web.parameters(), lr=1e-3)
scaler = torch.amp.GradScaler("cuda")
xb = torch.randn(128, 256, gadget=DEV); yb = torch.randint(0, 10, (128,), gadget=DEV)
lossfn = torch.nn.CrossEntropyLoss()
for _ in vary(20):
optimizer.zero_grad()
with torch.amp.autocast("cuda", dtype=torch.float16):
loss = lossfn(web(xb), yb)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.replace()
print(f" torch.amp: closing loss = {loss.merchandise():.4f}")
We display the legacy apex.amp mixed-precision workflow by operating small coaching loops throughout completely different choose ranges, equivalent to O0, O1, and O2. We use amp.initialize and amp.scale_loss to present how Apex handles mannequin wrapping and loss scaling within the older API. We then run the identical type of combined precision coaching with fashionable torch.amp, which is the really helpful method for brand new PyTorch code.
print("n### SECTION D: end-to-end Transformer (vanilla fp32 vs Apex fused + AMP) ###")
VOCAB, D, NHEAD, LAYERS, SEQ, BATCH, STEPS = 2000, 256, 4, 4, 128, 32, 60
class Block(torch.nn.Module):
def __init__(self, d, nhead, norm_cls):
tremendous().__init__()
self.attn = torch.nn.MultiheadAttention(d, nhead, batch_first=True)
self.ff = torch.nn.Sequential(torch.nn.Linear(d, 4 * d), torch.nn.GELU(),
torch.nn.Linear(4 * d, d))
self.n1, self.n2 = norm_cls(d), norm_cls(d)
def ahead(self, x):
h = self.n1(x); x = x + self.attn(h, h, h, need_weights=False)[0]
return x + self.ff(self.n2(x))
class TinyTransformer(torch.nn.Module):
def __init__(self, norm_cls):
tremendous().__init__()
self.emb = torch.nn.Embedding(VOCAB, D)
self.blocks = torch.nn.ModuleChecklist([Block(D, NHEAD, norm_cls) for _ in range(LAYERS)])
self.norm = norm_cls(D)
self.head = torch.nn.Linear(D, VOCAB)
def ahead(self, idx):
x = self.emb(idx)
for b in self.blocks:
x = b(x)
return self.head(self.norm(x))
g = torch.Generator(gadget="cpu").manual_seed(0)
information = torch.randint(0, VOCAB, (BATCH, SEQ + 1), generator=g).to(DEV)
inp, tgt = information[:, :-1], information[:, 1:]
lossfn = torch.nn.CrossEntropyLoss()
def run_training(use_apex):
torch.manual_seed(0)
norm_cls = (FusedLayerNorm if (use_apex and HAS_FLN and APEX_OK) else torch.nn.LayerNorm)
mannequin = TinyTransformer(norm_cls).to(DEV)
if use_apex and HAS_AMP_C and APEX_OK:
optimizer = FusedAdam(mannequin.parameters(), lr=3e-4)
else:
optimizer = torch.optim.AdamW(mannequin.parameters(), lr=3e-4)
scaler = torch.amp.GradScaler("cuda", enabled=use_apex)
def one_step():
optimizer.zero_grad(set_to_none=True)
with torch.amp.autocast("cuda", dtype=torch.float16, enabled=use_apex):
logits = mannequin(inp)
loss = lossfn(logits.reshape(-1, VOCAB), tgt.reshape(-1))
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.replace()
return loss
for _ in vary(5):
one_step()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in vary(STEPS):
loss = one_step()
torch.cuda.synchronize()
dt = time.perf_counter() - t0
return loss.merchandise(), (STEPS * BATCH * SEQ) / dt, dt
loss_v, tps_v, dt_v = run_training(use_apex=False)
print(f" vanilla (fp32, nn.LayerNorm, AdamW) : "
f"{dt_v:5.2f}s | {tps_v:9.0f} tok/s | closing loss {loss_v:.3f}")
if APEX_OK and (HAS_AMP_C or HAS_FLN):
loss_a, tps_a, dt_a = run_training(use_apex=True)
print(f" apex (fp16, FusedLayerNorm, FusedAdam) : "
f"{dt_a:5.2f}s | {tps_a:9.0f} tok/s | closing loss {loss_a:.3f}")
print(f" ----> speedup: {tps_a / tps_v:0.2f}x throughput")
else:
print(" apex path SKIPPED (no fused kernels constructed)")
print("n" + "=" * 78)
print("DONE. Key takeaways:")
print(" - FusedAdam/FusedLayerNorm/FusedRMSNorm are the still-relevant Apex items;")
print(" speedups develop with mannequin measurement & parameter depend (tiny demo understates it).")
print(" - apex.amp is deprecated -> choose torch.amp.autocast + torch.amp.GradScaler.")
print(" - FusedAdam composes cleanly with native torch.amp (Section D).")
print(" - On actual workloads, additionally strive a bigger mannequin and bf16 autocast (no scaler wanted).")
print("=" * 78)
We construct a small Transformer with consideration blocks, feed-forward layers, embeddings, and normalization to check Apex in an end-to-end coaching workload. We practice it as soon as with vanilla FP32 PyTorch utilizing AdamW and normal LayerNorm, then practice it once more with fused Apex elements and native PyTorch AMP when the kernels can be found. We lastly examine runtime, token throughput, closing loss, and speedup to perceive how fused kernels have an effect on actual coaching efficiency.
In conclusion, we now have a transparent and sensible understanding of the place NVIDIA Apex nonetheless matches in a 2026 deep studying workflow. We noticed that Apex is now not primarily about combined precision, since native PyTorch AMP now handles that facet extra cleanly. However, its fused optimizer and fused normalization kernels can nonetheless be helpful when the surroundings helps a correct CUDA extension construct. We additionally realized how to write Apex-aware code that doesn’t break when fused kernels are unavailable, making the tutorial extra dependable throughout Colab runtimes. The closing Transformer benchmark provides us a whole view of how FusedAdam, FusedLayerNorm, and torch.amp can work collectively in an end-to-end coaching loop. Also, we used this tutorial to transfer past set up and API utilization, and we evaluated Apex appropriately evaluated: by checking kernel availability, evaluating in opposition to PyTorch baselines, and measuring efficiency in an precise coaching workload.
Check out the Full Codes with Notebook. Also, be happy to observe us on Twitter and don’t neglect to be part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The publish How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp appeared first on MarkTechPost.
