Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

GPU communication overhead is a measurable bottleneck in manufacturing AI workloads. According to knowledge cited by the mKernel undertaking, communication can devour 43.6% of the ahead go and 32% of end-to-end coaching time. Across fashionable Mixture-of-Experts (MoE) fashions, inter-device communication can account for as much as 47% of whole execution time. Researchers from UC Berkeley’s UCCL undertaking have launched mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute right into a single kernel.

The Problem: Host-Driven Communication

The commonplace mannequin for multi-GPU communication is host-driven: the CPU runs the management path and calls right into a library like NCCL or NVSHMEM. The library points the collective operation — an AllCut back, an AllCollect, and so on. — throughout GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries.

The analysis crew identifies two issues with this strategy:

(1) CPUs should not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core efficiency, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At these speeds, microsecond-scale host orchestration overhead — a cudaLaunchKernel name, a CPU-side “all writes performed” examine, an inter-stream occasion — reveals up instantly as pipeline bubbles.

(2) Host-driven programs overlap compute and communication at coarse kernel boundaries. Finer-grained overlap on the tile or chunk stage is just not potential from the host facet.

The different is GPU-driven communication: the GPU itself triggers transfers, with communication fused into the identical kernel because the compute. Most current fused kernel libraries function inside a single node, or a single GPU. mKernel targets the multi-node case.

What mKernel Does

mKernel is a library of persistent CUDA kernels. Each kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute right into a single kernel.

Multi-GPU + multi-node, in a single kernel: Both intra-node NVLink and inter-node RDMA reside inside the identical persistent kernel.

Fine-grained intra-kernel overlap: Compute and communication overlap at tile/chunk granularity, overlaying each intra-node and inter-node GPU communication.

Persistent kernel with SM specialization: CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. The variety of SMs devoted to every function is tunable per form.

GPU-driven networking constructed on libibverbs: mKernel makes use of GPU-initiated RDMA writes with out relying on NCCL or NVSHMEM. The communication backend is written from scratch to maximise efficiency and assist heterogeneous networking gadgets.

The Five Fused Kernels

Kernel	What it fuses	Description
AllCollect + GEMM	AllCollect → GEMM	Each rank holds a shard of `A`. While ranks collect friends’ shards over NVLink/RDMA, the native GEMM consumes tiles as quickly as they arrive.
GEMM + AllCut back	GEMM → AllCut back	Computes `C = A @ B` and reduces partial outputs throughout all ranks in a single launch. Output tiles are pushed into the discount tree the moment they’re produced.
MoE Dispatch + GEMM	All-to-All dispatch → grouped GEMM	Routes MoE tokens to their knowledgeable ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the identical kernel. Tokens are processed as quickly as they land — no staging buffer round-trip.
Ring Attention	Ring KV trade → FlashAttention	Sequence-parallel consideration throughout ranks. Each step rotates a KV chunk across the ring whereas the native FlashAttention consumes the previously-received chunk. Compute and the ring ship/recv run concurrently inside a single persistent kernel.
GEMM + ReduceScatter	GEMM → ReduceScatter	Computes `C = A @ B` and reduce-scatters the output. Each output tile is decreased and forwarded to its proudly owning rank as quickly as it’s produced.

Evaluation Setup

The analysis crew evaluated mKernel on two 2-node × 8-H200 clusters that differ solely of their inter-node cloth:

Testbed	Nodes × GPUs	Intra-node	Inter-node transport	NIC
AWS EFA	2 × 8 H200	NVLink	AWS EFA / SRD	16 × 200 Gb/s EFA per node
ConnectX-7	2 × 8 H200	NVLink	InfiniBand	8 × 400 Gb/s NVIDIA ConnectX-7 per node

mKernel was benchmarked towards NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The crew notes that additional benchmarking at bigger scale remains to be in progress.

Backends and Requirements

mKernel helps two networking backends:

Backend	Macro	Transport	Where it runs
CX7	`-DINTERNODE_BACKEND_IBVERBS`	libibverbs RC	ConnectX-7 / InfiniBand / RoCE
EFA	`-DINTERNODE_BACKEND_EFA`	libibverbs + efadv (SRD)	AWS p5/p5e (H200, EFA)

Both backends share the identical host-side API and the identical on-GPU kernel. Only the proxy/session implementation differs (session.h for CX7, session_efa.h for EFA). Requirements: NVIDIA Hopper GPUs (default construct targets sm_90a), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs improvement headers and libraries. The EFA backend requires AWS EFA set up with libfabric, libibverbs, efadv, and EFA headers beneath EFA_HOME=/decide/amazon/efa by default.

Marktechpost’s Visual Explainer

UCCL

mKernel — Multi-GPU, Multi-Node Fused Kernels Guide

01 / 07 — Overview

What is mKernel?

mKernel is an open-source library of persistent CUDA kernels from UC Berkeley’s UCCL undertaking. It fuses intra-node NVLink communication, inter-node RDMA, and dense compute right into a single kernel.

Most current fused kernel libraries function inside a single node or a single GPU. mKernel is designed from the begin to span node boundaries.

43.6%

of ahead go consumed by communication in manufacturing

47%

of whole execution time in fashionable MoE fashions

32%

of end-to-end coaching time consumed by communication

02 / 07 — The Problem

Why Host-Driven Communication Falls Short

The commonplace mannequin is host-driven: the CPU calls NCCL or NVSHMEM, which points collective operations throughout GPUs. The UCCL crew identifies two issues.

CPUs should not scaling with GPUs. A GB300 NVL72 rack delivers 720 PFLOP/s FP8/FP6 and 1.44 EFLOP/s FP4. At these speeds, microsecond-scale overhead from cudaLaunchKernel, CPU-side sync checks, and inter-stream occasions reveals up instantly as pipeline bubbles.

Overlap is simply too coarse. Host-driven programs overlap compute and communication solely at kernel boundaries. Finer-grained overlap on the tile or chunk stage is just not potential from the host facet.

The reply: GPU-driven communication. The GPU itself triggers fine-grained transfers, fused into the identical kernel because the compute.

03 / 07 — Design

Four Core Design Properties

🖧

Multi-GPU + multi-node, in a single kernel. Intra-node NVLink and inter-node RDMA each reside inside the identical persistent kernel.

Fine-grained intra-kernel overlap. Compute and communication overlap at tile/chunk granularity, overlaying each intra-node and inter-node communication.

Persistent kernel with SM specialization. CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. SM break up is tunable per form.

GPU-driven networking by way of libibverbs. Uses GPU-initiated RDMA writes. No NCCL or NVSHMEM dependency. Communication backend is written from scratch.

04 / 07 — Kernels

The Five Fused Kernels

AllCollect + GEMM

AllCollect —> GEMM

Each rank holds a shard of A. The native GEMM consumes tiles over NVLink/RDMA as they arrive — matmul begins earlier than the collective finishes.

GEMM + AllCut back

GEMM —> AllCut back

Computes C = A @ B and reduces partial outputs throughout all ranks in a single launch. Output tiles enter the discount tree the moment they’re produced.

MoE Dispatch + GEMM

All-to-All dispatch —> grouped GEMM

Routes MoE tokens to knowledgeable ranks by way of NVLink + inter-node all-to-all, then runs per-expert grouped GEMM in the identical kernel. No staging buffer round-trip.

Ring Attention

Ring KV trade —> FlashAttention

Sequence-parallel consideration throughout ranks. Each step rotates a KV chunk across the ring whereas the native FlashAttention consumes the previously-received chunk.

GEMM + ReduceScatter

GEMM —> ReduceScatter

Computes C = A @ B and reduce-scatters the output. Each tile is decreased and forwarded to its proudly owning rank as quickly as it’s produced.

05 / 07 — Evaluation

Evaluation Setup

Tested on two 2-node × 8-H200 clusters differing solely in inter-node cloth.

Testbed	Nodes × GPUs	Inter-node	NIC
AWS EFA	2 × 8 H200	AWS EFA / SRD	16 × 200 Gb/s EFA per node
ConnectX-7	2 × 8 H200	InfiniBand	8 × 400 Gb/s CX7 per node

Both testbeds use NVLink intra-node. Benchmarked towards: NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. Larger-scale benchmarking remains to be in progress.

06 / 07 — Backends & Requirements

Backends & Requirements

Backend	Transport	Where it runs
CX7	libibverbs RC	ConnectX-7 / InfiniBand / RoCE
EFA	libibverbs + efadv (SRD)	AWS p5/p5e (H200, EFA)

Requirements: NVIDIA Hopper GPUs (default sm_90a), CUDA 12.9, Python with PyTorch. CX7 wants libibverbs headers. EFA wants libfabric, libibverbs, efadv beneath EFA_HOME=/decide/amazon/efa.

License & Attribution: MIT licensed. MMA/compute code tailored from ThunderKittens (HazyResearch).

07 / 07 — Roadmap & Key Takeaways

Roadmap & Key Takeaways

✅

Fused GPU-driven multi-node kernels (AG+GEMM, GEMM+AR, MoE Dispatch+GEMM, Ring Attention, GEMM+RS)

✅

ConnectX-7 and AWS EFA backends

🚧

Full heterogeneous accelerator/NIC assist with topology-aware discovery, placement, routing

🚧

Inter-node megakernels: collapsing a number of fused steps right into a single megakernel spanning a transformer layer

🚧

Blackwell GPU assist

Fuses NVLink, inter-node RDMA, and compute right into a single persistent CUDA kernel

Five kernels: AllCollect+GEMM, GEMM+AllCut back, MoE Dispatch+GEMM, Ring Attention, GEMM+ReduceScatter

GPU-initiated RDMA by way of libibverbs — no NCCL or NVSHMEM dependency

Requires Hopper GPUs (sm_90a) and ConnectX-7 or AWS EFA networking

Key Takeaways

mKernel fuses intra-node NVLink, inter-node RDMA, and compute right into a single persistent CUDA kernel.
Communication overhead accounts for as much as 47% of execution time in MoE fashions per cited manufacturing knowledge.
Five kernels are included: AllCollect+GEMM, GEMM+AllCut back, MoE Dispatch+GEMM, Ring Attention, and GEMM+ReduceScatter.
GPU-initiated RDMA is carried out instantly by way of libibverbs — no NCCL or NVSHMEM dependency.
Currently requires Hopper GPUs (sm_90a) and ConnectX-7 or AWS EFA networking; Blackwell assist is on the roadmap.

Check out the Repo and Technical Details. Also, be happy to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The put up Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication appeared first on MarkTechPost.

Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

The Problem: Host-Driven Communication

What mKernel Does

The Five Fused Kernels

Evaluation Setup

Backends and Requirements

Marktechpost’s Visual Explainer

What is mKernel?

Why Host-Driven Communication Falls Short

Four Core Design Properties

The Five Fused Kernels

Evaluation Setup

Backends & Requirements

Roadmap & Key Takeaways

Key Takeaways

VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents

NVIDIA AI Releases Orchestrator-8B: A Reinforcement Learning Trained Controller for Efficient Tool and Model Selection

A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction

Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency

Your data engineers may be more influential than you think

OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

The Problem: Host-Driven Communication

What mKernel Does

The Five Fused Kernels

Evaluation Setup

Backends and Requirements

Marktechpost’s Visual Explainer

What is mKernel?

Why Host-Driven Communication Falls Short

Four Core Design Properties

The Five Fused Kernels

Evaluation Setup

Backends & Requirements

Roadmap & Key Takeaways

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!