Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication
GPU communication overhead is a measurable bottleneck in manufacturing AI workloads. According to knowledge cited by the mKernel undertaking, communication can devour 43.6% of the ahead go and 32% of end-to-end coaching time. Across fashionable Mixture-of-Experts (MoE) fashions, inter-device communication can account for as much as 47% of whole execution time. Researchers from UC Berkeley’s UCCL undertaking have launched mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute right into a single kernel.
The Problem: Host-Driven Communication
The commonplace mannequin for multi-GPU communication is host-driven: the CPU runs the management path and calls right into a library like NCCL or NVSHMEM. The library points the collective operation — an AllCut back, an AllCollect, and so on. — throughout GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries.
The analysis crew identifies two issues with this strategy:
(1) CPUs should not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core efficiency, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At these speeds, microsecond-scale host orchestration overhead — a cudaLaunchKernel name, a CPU-side “all writes performed” examine, an inter-stream occasion — reveals up instantly as pipeline bubbles.
(2) Host-driven programs overlap compute and communication at coarse kernel boundaries. Finer-grained overlap on the tile or chunk stage is just not potential from the host facet.
The different is GPU-driven communication: the GPU itself triggers transfers, with communication fused into the identical kernel because the compute. Most current fused kernel libraries function inside a single node, or a single GPU. mKernel targets the multi-node case.
What mKernel Does
mKernel is a library of persistent CUDA kernels. Each kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute right into a single kernel.
Multi-GPU + multi-node, in a single kernel: Both intra-node NVLink and inter-node RDMA reside inside the identical persistent kernel.
Fine-grained intra-kernel overlap: Compute and communication overlap at tile/chunk granularity, overlaying each intra-node and inter-node GPU communication.
Persistent kernel with SM specialization: CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. The variety of SMs devoted to every function is tunable per form.
GPU-driven networking constructed on libibverbs: mKernel makes use of GPU-initiated RDMA writes with out relying on NCCL or NVSHMEM. The communication backend is written from scratch to maximise efficiency and assist heterogeneous networking gadgets.
The Five Fused Kernels
| Kernel | What it fuses | Description |
|---|---|---|
| AllCollect + GEMM | AllCollect → GEMM | Each rank holds a shard of A. While ranks collect friends’ shards over NVLink/RDMA, the native GEMM consumes tiles as quickly as they arrive. |
| GEMM + AllCut back | GEMM → AllCut back | Computes C = A @ B and reduces partial outputs throughout all ranks in a single launch. Output tiles are pushed into the discount tree the moment they’re produced. |
| MoE Dispatch + GEMM | All-to-All dispatch → grouped GEMM | Routes MoE tokens to their knowledgeable ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the identical kernel. Tokens are processed as quickly as they land — no staging buffer round-trip. |
| Ring Attention | Ring KV trade → FlashAttention | Sequence-parallel consideration throughout ranks. Each step rotates a KV chunk across the ring whereas the native FlashAttention consumes the previously-received chunk. Compute and the ring ship/recv run concurrently inside a single persistent kernel. |
| GEMM + ReduceScatter | GEMM → ReduceScatter | Computes C = A @ B and reduce-scatters the output. Each output tile is decreased and forwarded to its proudly owning rank as quickly as it’s produced. |
Evaluation Setup
The analysis crew evaluated mKernel on two 2-node × 8-H200 clusters that differ solely of their inter-node cloth:
| Testbed | Nodes × GPUs | Intra-node | Inter-node transport | NIC |
|---|---|---|---|---|
| AWS EFA | 2 × 8 H200 | NVLink | AWS EFA / SRD | 16 × 200 Gb/s EFA per node |
| ConnectX-7 | 2 × 8 H200 | NVLink | InfiniBand | 8 × 400 Gb/s NVIDIA ConnectX-7 per node |
mKernel was benchmarked towards NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The crew notes that additional benchmarking at bigger scale remains to be in progress.
Backends and Requirements
mKernel helps two networking backends:
| Backend | Macro | Transport | Where it runs |
|---|---|---|---|
| CX7 | -DINTERNODE_BACKEND_IBVERBS |
libibverbs RC | ConnectX-7 / InfiniBand / RoCE |
| EFA | -DINTERNODE_BACKEND_EFA |
libibverbs + efadv (SRD) | AWS p5/p5e (H200, EFA) |
Both backends share the identical host-side API and the identical on-GPU kernel. Only the proxy/session implementation differs (session.h for CX7, session_efa.h for EFA). Requirements: NVIDIA Hopper GPUs (default construct targets sm_90a), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs improvement headers and libraries. The EFA backend requires AWS EFA set up with libfabric, libibverbs, efadv, and EFA headers beneath EFA_HOME=/decide/amazon/efa by default.
Marktechpost’s Visual Explainer
Key Takeaways
- mKernel fuses intra-node NVLink, inter-node RDMA, and compute right into a single persistent CUDA kernel.
- Communication overhead accounts for as much as 47% of execution time in MoE fashions per cited manufacturing knowledge.
- Five kernels are included: AllCollect+GEMM, GEMM+AllCut back, MoE Dispatch+GEMM, Ring Attention, and GEMM+ReduceScatter.
- GPU-initiated RDMA is carried out instantly by way of
libibverbs— no NCCL or NVSHMEM dependency. - Currently requires Hopper GPUs (
sm_90a) and ConnectX-7 or AWS EFA networking; Blackwell assist is on the roadmap.
Check out the Repo and Technical Details. Also, be happy to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The put up Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication appeared first on MarkTechPost.








