|

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode

MoonMath AI group has launched a bf16 ahead consideration kernel for AMD’s MI300X GPU. It is written in HIP, not hand-written meeting. The code is open-source below the MIT license. The MoonMath.ai group studies it beats AITER v3, AMD’s personal optimized kernel, on each examined form. Bare-metal entry got here from HotAisle, an AMD cloud supplier.

Attention is the fused softmax(QKᵀ/√d)·V operation inside each transformer. The MI300X is AMD’s CDNA3 data-center GPU, with the ISA goal (gfx942). This kernel runs on that {hardware} solely.

TL;DR

  • MoonMath.ai open-sources a bf16 ahead consideration kernel for AMD MI300X, written in HIP, not meeting (MIT).
  • It beats AMD’s AITER v3 on each form and rounding mode — geomean 1.18×/1.15×/1.08×, as much as 1.26×.
  • The core trick: one-instruction asm wrappers allow you to choose the opcode whereas the compiler allocates registers.
  • Most of the speedup is reminiscence placement — Okay in LDS, V sizzling in L1, Q and accumulators in registers.
  • An actual SGLang PR used it to hurry up Wan2.1 video diffusion by 1.23×, with no high quality regression.

Understanding Kernel

A kernel is a small program that runs immediately on the GPU’s many cores to carry out one particular computation—right here, the eye math—as quick because the {hardware} permits. The kernel computes ahead consideration in bf16 on MI300X solely. It takes inputs in both BSHD or BHSD structure, with no transpose. Head dimension is mounted at 128. It helps any sequence size, together with cross-attention.

There are actual limits. There is not any causal masks, no GQA, and no varlen batching. Outputs are bf16, and it runs on gfx942 {hardware} completely.

Numerics are tightly managed. All three rounding modes match AITER’s per-mode rounding rule. Every finite output sits inside 1 bf16 ULP of AITER. NaN and Inf dealing with is bit-identical, and outcomes are deterministic.

The Core Trick: One-Instruction asm Wrappers

The core method avoids a acquainted dilemma. Compiler intrinsics hold code tidy however let the compiler reorder or rename operands. Raw inline meeting offers management however forces handbook register and handle administration.

MoonMath wraps precisely one instruction in a __device__ __forceinline__ operate. Extended asm constraints describe the operands. The analysis group picks the opcode. The compiler nonetheless allocates registers and tracks knowledge circulate.

// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.
__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {
    asm risky("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"
                 : "+v"(c) : "v"(a), "v"(b));
}

The "+v"(c) constraint ties the accumulator enter and output to the identical VGPR. No copy instruction is emitted. This retains the kernel near atypical HIP. It nonetheless steers the machine one instruction at a time.

The Architecture: Eight Waves, Two Groups, Two Barriers

A CDNA3 compute unit has 4 SIMD items. The textbook block is 4 waves. MoonMath as a substitute runs eight waves per block, in two teams of 4.

The two teams run the identical Q*Okay, softmax, O += P*V sequence. They are offset by a section. While one group saturates the matrix core, the opposite runs softmax and points hundreds. Then they swap, so the matrix core by no means idles.

There are two s_barriers per iteration. One sits on the section handoff. One sits on the iteration boundary. Per-counter waits deal with the remainder of the synchronization.

This echoes FlashAttention-3’s matmul and softmax alternation. It doesn’t copy FA3’s producer and shopper warp break up. On CDNA3, each reminiscence transfer is already asynchronous, so a devoted producer wave is pointless.

Where Data Lives, and Why 16×16×16

Most of the speedup comes from reminiscence placement. Okay streams from HBM into LDS, double-buffered, shared by all eight waves. V stays sizzling in L1, learn on each PV matmul. Q and accumulators reside in registers.

The analysis group picked the 16×16×16 MFMA over 32×32×8. Both shapes have equivalent throughput. The smaller tile accumulates into 4 fp32 parts per lane, in opposition to 16. Lower accumulator strain leaves room for deeper prefetch and a third Q tile.

Decision Choice Reason
Waves per block 8 (two teams of 4) Plan the pipeline immediately; share one Okay copy
MFMA form 16×16×16 bf16 Same throughput, decrease VGPR strain, higher energy effectivity
Okay placement LDS, double-buffered, 32 KiB Shared by all 8 waves, swapped per iteration
V placement L1, resident, prefetched Reread throughout PV, saved sizzling intentionally
Q + accumulators VGPRs Read each iteration, by no means reloaded

Two later wins shut the hole. A 3rd Q tile (3Q) raises knowledge reuse per loaded Okay and V tile. A Flash-Decoding-style tail KV break up rescues the stranded fractional spherical throughout MI300X’s 304 CUs. These wins cascade. Moving V to L1 freed the LDS that the third Q tile then fills.

Benchmark

Tests ran on MI300X in bf16, head dimension 128. Each form was measured at three rounding modes. RTNE rounds to nearest even. RTNA rounds to nearest, ties away from zero. RTZ truncates towards zero.

Shape (B, H, S, D) Round Ours (ms) AITER v3 (ms) vs AITER vs MAX
(2, 24, 8192, 128) RTNE 3.083 3.792 1.23× 1.37×
(2, 24, 16384, 128) RTNE 11.670 14.691 1.26× 1.54×
(4, 16, 16384, 128) RTZ 15.055 16.183 1.07× 1.47×
(2, 24, 32768, 128) RTNA 44.440 52.363 1.18× 1.57×
(1, 16, 131072, 128) RTNE 232.517 269.278 1.16× 1.46×

Geomeans throughout the sweep favor MoonMath. Versus AITER, it scores 1.18× (RTNE), 1.15× (RTNA), and 1.08× (RTZ). Versus Modular MAX, geomeans run 1.44× to 1.49×, and per-shape speedups attain 1.59×.

RTZ is AITER’s personal quickest mode and the tightest race. The (4, 16, 16384) RTZ form moved from 0.95× to 1.07×. The tail KV break up is what closed that remaining hole.

Interactive Explainer