|

Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

⚡

GPU communication overhead is a measurable bottleneck in manufacturing AI workloads. According to knowledge cited by the mKernel undertaking, communication can devour 43.6% of the ahead go and 32% of end-to-end coaching time. Across fashionable Mixture-of-Experts (MoE) fashions, inter-device communication can account for as much as 47% of whole execution time. Researchers from UC Berkeley’s UCCL undertaking have launched mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute right into a single kernel.

The Problem: Host-Driven Communication

The commonplace mannequin for multi-GPU communication is host-driven: the CPU runs the management path and calls right into a library like NCCL or NVSHMEM. The library points the collective operation — an AllCut back, an AllCollect, and so on. — throughout GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries.

The analysis crew identifies two issues with this strategy:

(1) CPUs should not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core efficiency, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At these speeds, microsecond-scale host orchestration overhead — a cudaLaunchKernel name, a CPU-side “all writes performed” examine, an inter-stream occasion — reveals up instantly as pipeline bubbles.

(2) Host-driven programs overlap compute and communication at coarse kernel boundaries. Finer-grained overlap on the tile or chunk stage is just not potential from the host facet.

The different is GPU-driven communication: the GPU itself triggers transfers, with communication fused into the identical kernel because the compute. Most current fused kernel libraries function inside a single node, or a single GPU. mKernel targets the multi-node case.

What mKernel Does

mKernel is a library of persistent CUDA kernels. Each kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute right into a single kernel.

Multi-GPU + multi-node, in a single kernel: Both intra-node NVLink and inter-node RDMA reside inside the identical persistent kernel.

Fine-grained intra-kernel overlap: Compute and communication overlap at tile/chunk granularity, overlaying each intra-node and inter-node GPU communication.

Persistent kernel with SM specialization: CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. The variety of SMs devoted to every function is tunable per form.

GPU-driven networking constructed on libibverbs: mKernel makes use of GPU-initiated RDMA writes with out relying on NCCL or NVSHMEM. The communication backend is written from scratch to maximise efficiency and assist heterogeneous networking gadgets.

The Five Fused Kernels

Kernel What it fuses Description
AllCollect + GEMM AllCollect → GEMM Each rank holds a shard of A. While ranks collect friends’ shards over NVLink/RDMA, the native GEMM consumes tiles as quickly as they arrive.
GEMM + AllCut back GEMM → AllCut back Computes C = A @ B and reduces partial outputs throughout all ranks in a single launch. Output tiles are pushed into the discount tree the moment they’re produced.
MoE Dispatch + GEMM All-to-All dispatch → grouped GEMM Routes MoE tokens to their knowledgeable ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the identical kernel. Tokens are processed as quickly as they land — no staging buffer round-trip.
Ring Attention Ring KV trade → FlashAttention Sequence-parallel consideration throughout ranks. Each step rotates a KV chunk across the ring whereas the native FlashAttention consumes the previously-received chunk. Compute and the ring ship/recv run concurrently inside a single persistent kernel.
GEMM + ReduceScatter GEMM → ReduceScatter Computes C = A @ B and reduce-scatters the output. Each output tile is decreased and forwarded to its proudly owning rank as quickly as it’s produced.

Evaluation Setup

The analysis crew evaluated mKernel on two 2-node × 8-H200 clusters that differ solely of their inter-node cloth:

Testbed Nodes × GPUs Intra-node Inter-node transport NIC
AWS EFA 2 × 8 H200 NVLink AWS EFA / SRD 16 × 200 Gb/s EFA per node
ConnectX-7 2 × 8 H200 NVLink InfiniBand 8 × 400 Gb/s NVIDIA ConnectX-7 per node

mKernel was benchmarked towards NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The crew notes that additional benchmarking at bigger scale remains to be in progress.

Backends and Requirements

mKernel helps two networking backends:

Backend Macro Transport Where it runs
CX7 -DINTERNODE_BACKEND_IBVERBS libibverbs RC ConnectX-7 / InfiniBand / RoCE
EFA -DINTERNODE_BACKEND_EFA libibverbs + efadv (SRD) AWS p5/p5e (H200, EFA)

Both backends share the identical host-side API and the identical on-GPU kernel. Only the proxy/session implementation differs (session.h for CX7, session_efa.h for EFA). Requirements: NVIDIA Hopper GPUs (default construct targets sm_90a), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs improvement headers and libraries. The EFA backend requires AWS EFA set up with libfabric, libibverbs, efadv, and EFA headers beneath EFA_HOME=/decide/amazon/efa by default.

Marktechpost’s Visual Explainer

mKernel — Multi-GPU, Multi-Node Fused Kernels Guide

01 / 07 — Overview

What is mKernel?

mKernel is an open-source library of persistent CUDA kernels from UC Berkeley’s UCCL undertaking. It fuses intra-node NVLink communication, inter-node RDMA, and dense compute right into a single kernel.

Most current fused kernel libraries function inside a single node or a single GPU. mKernel is designed from the begin to span node boundaries.

43.6%
of ahead go consumed by communication in manufacturing

47%
of whole execution time in fashionable MoE fashions

32%
of end-to-end coaching time consumed by communication

02 / 07 — The Problem

Why Host-Driven Communication Falls Short

The commonplace mannequin is host-driven: the CPU calls NCCL or NVSHMEM, which points collective operations throughout GPUs. The UCCL crew identifies two issues.

⚡
CPUs should not scaling with GPUs. A GB300 NVL72 rack delivers 720 PFLOP/s FP8/FP6 and 1.44 EFLOP/s FP4. At these speeds, microsecond-scale overhead from cudaLaunchKernel, CPU-side sync checks, and inter-stream occasions reveals up instantly as pipeline bubbles.

🔲
Overlap is simply too coarse. Host-driven programs overlap compute and communication solely at kernel boundaries. Finer-grained overlap on the tile or chunk stage is just not potential from the host facet.

🔀
The reply: GPU-driven communication. The GPU itself triggers fine-grained transfers, fused into the identical kernel because the compute.

03 / 07 — Design

Four Core Design Properties

🖧
Multi-GPU + multi-node, in a single kernel. Intra-node NVLink and inter-node RDMA each reside inside the identical persistent kernel.

🔬
Fine-grained intra-kernel overlap. Compute and communication overlap at tile/chunk granularity, overlaying each intra-node and inter-node communication.

⚙
Persistent kernel with SM specialization. CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. SM break up is tunable per form.

📡
GPU-driven networking by way of libibverbs. Uses GPU-initiated RDMA writes. No NCCL or NVSHMEM dependency. Communication backend is written from scratch.

04 / 07 — Kernels

The Five Fused Kernels

AllCollect + GEMM
AllCollect —> GEMM
Each rank holds a shard of A. The native GEMM consumes tiles over NVLink/RDMA as they arrive — matmul begins earlier than the collective finishes.

GEMM + AllCut back
GEMM —> AllCut back
Computes C = A @ B and reduces partial outputs throughout all ranks in a single launch. Output tiles enter the discount tree the moment they’re produced.

MoE Dispatch + GEMM
All-to-All dispatch —> grouped GEMM
Routes MoE tokens to knowledgeable ranks by way of NVLink + inter-node all-to-all, then runs per-expert grouped GEMM in the identical kernel. No staging buffer round-trip.

Ring Attention
Ring KV trade —> FlashAttention
Sequence-parallel consideration throughout ranks. Each step rotates a KV chunk across the ring whereas the native FlashAttention consumes the previously-received chunk.

GEMM + ReduceScatter
GEMM —> ReduceScatter
Computes C = A @ B and reduce-scatters the output. Each tile is decreased and forwarded to its proudly owning rank as quickly as it’s produced.

05 / 07 — Evaluation

Evaluation Setup

Tested on two 2-node × 8-H200 clusters differing solely in inter-node cloth.

Testbed Nodes × GPUs Inter-node NIC
AWS EFA 2 × 8 H200 AWS EFA / SRD 16 × 200 Gb/s EFA per node
ConnectX-7 2 × 8 H200 InfiniBand 8 × 400 Gb/s CX7 per node

Both testbeds use NVLink intra-node. Benchmarked towards: NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. Larger-scale benchmarking remains to be in progress.

06 / 07 — Backends & Requirements

Backends & Requirements

Backend Transport Where it runs
CX7 libibverbs RC ConnectX-7 / InfiniBand / RoCE
EFA libibverbs + efadv (SRD) AWS p5/p5e (H200, EFA)
📋
Requirements: NVIDIA Hopper GPUs (default sm_90a), CUDA 12.9, Python with PyTorch. CX7 wants libibverbs headers. EFA wants libfabric, libibverbs, efadv beneath EFA_HOME=/decide/amazon/efa.

📝
License & Attribution: MIT licensed. MMA/compute code tailored from ThunderKittens (HazyResearch).

07 / 07 — Roadmap & Key Takeaways

Roadmap & Key Takeaways

Fused GPU-driven multi-node kernels (AG+GEMM, GEMM+AR, MoE Dispatch+GEMM, Ring Attention, GEMM+RS)
ConnectX-7 and AWS EFA backends
🚧
Full heterogeneous accelerator/NIC assist with topology-aware discovery, placement, routing
🚧
Inter-node megakernels: collapsing a number of fused steps right into a single megakernel spanning a transformer layer
🚧
Blackwell GPU assist
Fuses NVLink, inter-node RDMA, and compute right into a single persistent CUDA kernel
Five kernels: AllCollect+GEMM, GEMM+AllCut back, MoE Dispatch+GEMM, Ring Attention, GEMM+ReduceScatter
GPU-initiated RDMA by way of libibverbs — no NCCL or NVSHMEM dependency
Requires Hopper GPUs (sm_90a) and ConnectX-7 or AWS EFA networking

Key Takeaways

  • mKernel fuses intra-node NVLink, inter-node RDMA, and compute right into a single persistent CUDA kernel.
  • Communication overhead accounts for as much as 47% of execution time in MoE fashions per cited manufacturing knowledge.
  • Five kernels are included: AllCollect+GEMM, GEMM+AllCut back, MoE Dispatch+GEMM, Ring Attention, and GEMM+ReduceScatter.
  • GPU-initiated RDMA is carried out instantly by way of libibverbs — no NCCL or NVSHMEM dependency.
  • Currently requires Hopper GPUs (sm_90a) and ConnectX-7 or AWS EFA networking; Blackwell assist is on the roadmap.


Check out the Repo and Technical DetailsAlso, be happy to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The put up Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication appeared first on MarkTechPost.

Similar Posts