|

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX

NVIDIA AI researchers just lately launched cuda-oxide, an experimental compiler that permits builders to write CUDA SIMT (Single Instruction, Multiple Threads) GPU kernels in commonplace Rust code. The undertaking compiles Rust straight to PTX (Parallel Thread Execution) — the assembly-like intermediate illustration that CUDA makes use of to goal NVIDIA GPUs — with out requiring domain-specific languages, international perform interface bindings, or C/C++ code.

How This Makes a Change

Writing GPU kernels immediately sometimes means writing C++ and utilizing the CUDA programming mannequin straight, or counting on Python-level abstractions like Triton that generate CUDA underneath the hood. The Rust GPU ecosystem has had tasks trying to bridge this hole — Rust-GPU targets SPIR-V for Vulkan/graphics compute, rust-cuda makes use of a rustc codegen backend concentrating on NVVM IR, CubeCL makes use of an embedded DSL with a JIT runtime that cross-compiles to CUDA/ROCm/WGPU, and std::offload makes use of LLVM’s implicit offload path.

cuda-oxide occupies a particular place on this house. Its said design heart is “bringing CUDA into Rust” — kernel authoring, machine intrinsics, the SIMT execution mannequin, and the CUDA programming mannequin expressed natively in secure Rust — nearer in spirit to writing a __global__ perform in C++ than to writing a generic Rust perform that occurs to run on a GPU. By distinction, the closest neighbor, rust-cuda, focuses on “bringing Rust to NVIDIA GPUs”: Rust ergonomics like async/.await, components of the usual library operating on-device, and a Rust-first programming mannequin that abstracts over CUDA ideas. The NVlabs crew notes it has been coordinating with rust-cuda maintainers and considers the 2 tasks complementary.

The Compilation Pipeline

At the core of cuda-oxide is a customized rustc codegen backend — the layer within the Rust compiler liable for producing machine code. Instead of emitting native CPU code, the rustc-codegen-cuda crate intercepts the compiler on the CodegenBackend::codegen_crate() entry level and runs a separate pipeline for machine code:

Rust Source → rustc frontend → rustc_public (Stable MIR) → dialect-mirmem2regdialect-llvm → LLVM IR (.ll) → PTX (.ptx)

Here are some necessary parts:

Why rustc_public? The uncooked inside MIR illustration in rustc modifications between nightly variations with no stability ensures. cuda-oxide makes use of rustc_public — often known as Stable MIR — which is Rust’s official versioned, steady API over the compiler’s internals. This lets the backend learn MIR with out breaking on each nightly replace.

What is Pliron? The center levels use Pliron, a Rust-native MLIR-like IR framework written completely in Rust. Choosing Pliron as a substitute of upstream MLIR means your entire compiler builds with cargo — no C++ toolchain, no CMake, no tablegen. cuda-oxide defines three customized Pliron dialects: dialect-mir (modeling Rust MIR semantics — locations, projections, rvalues, terminators), dialect-llvm (modeling LLVM IR with textual .ll export), and dialect-nvvm (NVIDIA GPU intrinsics like thread indexing, boundaries, and TMA).

What does llc do? After the dialect-llvm printer serializes the IR right into a textual .ll file, the exterior llc binary (the LLVM static compiler with NVPTX backend) compiles it to PTX meeting. This is the one stage exterior pure Rust. The ensuing .ptx file is written subsequent to the host binary — for instance, goal/debug/vecadd.ptx — and loaded by the CUDA driver at runtime.

You as a developer can observe every stage with:

cargo oxide pipeline vecadd

This prints the complete hint from Rust MIR by every dialect down to PTX output.

Single-Source Compilation and the Host/Device Split

Host and machine code reside in the identical .rs supply file. cargo oxide units -Z codegen-backend=librustc_codegen_cuda.so, which routes code era by cuda-oxide’s backend. The backend then scans compiled code for monomorphized features whose names carry the reserved cuda_oxide_kernel_<hash>_<title> prefix — the namespace that the #[kernel] proc macro creates. Functions matching that prefix undergo the cuda-oxide pipeline to produce PTX; all different host code is delegated to rustc’s commonplace LLVM backend. The results of a single cargo oxide construct is a number binary plus a .ptx file.

cargo oxide run vecadd
cargo oxide debug vecadd --tui    # debug with cuda-gdb

Device code from library dependencies is compiled lazily: the backend reads their Stable MIR from .rlib metadata on demand, solely compiling features a kernel truly calls.

What You Can Write in a Kernel

cuda-oxide helps a significant subset of Rust in GPU kernel features, marked with the #[kernel] attribute macro. This contains:

  • Generic features with monomorphizationfn scale<T: Copy>(...) is compiled to a concrete PTX kernel per kind used on the name web site.
  • Closures with captures — closures handed from the host are scalarized and handed as PTX kernel parameters routinely.
  • User-defined structs and enums — commonplace Rust information buildings work inside kernels.
  • Pattern matchingmatch, if let, and associated constructs work in machine code.
  • Full GPU intrinsics — the cuda-device crate supplies wrappers for thread indexing, warp operations (shfl_sync, ballot_sync, and so on.), shared reminiscence, boundaries, TMA (Tensor Memory Accelerator), Thread Block Clusters, and scoped atomics (6 sorts × 3 scopes × 5 orderings).

One necessary GPU-specific compiler element: rustc’s SoarThreading MIR optimization — which duplicates perform calls into each branches of an if-statement — is disabled for machine code in cuda-oxide. On CPUs it is a secure optimization, however on GPUs it breaks barrier semantics: all threads in a block should converge on the similar bar.sync instruction, and duplicating it throughout branches violates that requirement. Additionally, sync primitives are marked convergent within the emitted LLVM IR so that LLVM’s optimization passes can’t transfer or duplicate them throughout management stream.

How to Use NVIDIA Star Elastic

NVlabs
cuda-oxide — Step-by-Step Guide
Rust → Stable MIR → Pliron IR → LLVM IR → PTX  |  v0.1.0

Step 01 of 09  ·  Prerequisites

What You Need Before You Start

cuda-oxide has particular model necessities for every dependency. Before putting in something, confirm your system meets all of those. The undertaking is at present Linux-only (examined on Ubuntu 24.04).

Linux (Ubuntu 24.04)
Rust nightly
CUDA Toolkit 12.x+
LLVM 21+
Clang 21 / libclang-common-21-dev
Git
ⓘ Why LLVM 21?
Simple kernels may match on LLVM 20, however something concentrating on Hopper or Blackwell — TMA, tcgen05, WGMMA — requires llc from LLVM 21 or later. This is a tough requirement, not a suggestion.

Check your present CUDA model to verify compatibility:

nvcc --version

Step 02 of 09  ·  Install Rust Nightly

Set Up the Rust Nightly Toolchain

cuda-oxide requires Rust nightly with two extra elements: rust-src and rustc-dev. The toolchain is pinned to nightly-2026-04-03 through rust-toolchain.toml within the repository — it is going to be put in routinely whenever you first run a construct contained in the repo.

If you want to set up it manually:

# Install the pinned nightly toolchain
rustup toolchain set up nightly-2026-04-03

# Add required elements
rustup element add rust-src rustc-dev 
  --toolchain nightly-2026-04-03

# Confirm the toolchain is energetic
rustup present
ⓘ Why these elements?
rustc-dev exposes the inner compiler APIs that the customized codegen backend hooks into. rust-src is required so the compiler can discover and compile its personal commonplace library sources for the machine goal.

Step 03 of 09  ·  Install LLVM 21

Install LLVM 21 with the NVPTX Backend

The cuda-oxide pipeline emits textual LLVM IR (.ll information) and arms them to the exterior llc binary to produce PTX. You want LLVM 21 or later with the NVPTX backend enabled.

# Ubuntu/Debian
sudo apt set up llvm-21

# Verify the NVPTX backend is current
llc-21 --version | grep nvptx

The pipeline auto-discovers llc-22 and llc-21 in your PATH in that order. To pin a particular binary, set the surroundings variable:

# Pin to a particular llc binary
export CUDA_OXIDE_LLC=/usr/bin/llc-21
⚠ Common Failure
If NVPTX doesn’t seem within the output of llc-21 --version, your LLVM construct was compiled with out the NVPTX goal. Install from the official LLVM apt repository fairly than your distro’s default packages, which can omit GPU backends.

Step 04 of 09  ·  Install Clang

Install Clang 21 for the cuda-bindings Crate

The cuda-bindings crate makes use of bindgen to generate FFI bindings to cuda.h at construct time. bindgen wants libclang — and particularly, it wants Clang’s personal useful resource listing (which incorporates stddef.h). A naked libclang1-* runtime bundle is not sufficient.

# Install the complete clang-21 bundle (contains useful resource headers)
sudo apt set up clang-21

# Alternatively, the -dev header bundle additionally works
sudo apt set up libclang-common-21-dev
⚠ Symptom of Missing Clang
If you solely set up the runtime however not the headers, the host construct will fail with a cryptic 'stddef.h' file not discovered error throughout bindgen. Run cargo oxide physician within the subsequent step to catch this earlier than trying a construct.

Step 05 of 09  ·  Install cargo-oxide

Clone the Repo and Install cargo-oxide

cargo-oxide is a Cargo subcommand that drives your entire construct pipeline — operating cargo oxide construct, cargo oxide run, cargo oxide debug, and cargo oxide pipeline.

Inside the repo (for attempting examples):

git clone https://github.com/NVlabs/cuda-oxide.git
cd cuda-oxide

# cargo oxide works out of the field through a workspace alias
cargo oxide run vecadd

Outside the repo (in your personal tasks):

# Install globally from the git supply
cargo set up 
  --git https://github.com/NVlabs/cuda-oxide.git 
  cargo-oxide

# On first run, cargo-oxide fetches and builds the codegen backend

Then confirm all conditions are in place with the built-in well being test:

cargo oxide physician
ⓘ What physician checks
It validates your Rust toolchain (nightly, rust-src, rustc-dev), CUDA Toolkit, LLVM model and NVPTX assist, Clang/libclang headers, and the codegen backend binary. Fix any pink gadgets earlier than continuing.

Step 06 of 09  ·  Run Your First Kernel

Build and Run the vecadd Example

The canonical first instance is vecadd — a vector addition kernel that provides two arrays of 1,024 f32 values on the GPU and verifies the outcome on the host.

# Build and run end-to-end
cargo oxide run vecadd

If every little thing is configured accurately, you will note:

✓ SUCCESS: All 1024 parts right!

To see the complete compilation pipeline — from Rust MIR by every Pliron dialect down to PTX — run:

# Print the complete Rust MIR — dialect-mir — mem2reg — dialect-llvm — LLVM IR — PTX hint
cargo oxide pipeline vecadd

To debug with cuda-gdb:

cargo oxide debug vecadd --tui
ⓘ Output artifacts
A profitable construct produces two information: goal/debug/vecadd (the host binary) and goal/debug/vecadd.ptx (the machine code). The host binary hundreds the PTX file through the CUDA driver at runtime.

Step 07 of 09  ·  Write a Kernel

Writing Your Own #[kernel] Function

A kernel perform is annotated with #[kernel]. Use DisjointSlice<T> for mutable outputs and &[T] for read-only inputs. Access the thread’s distinctive {hardware} index with thread::index_1d().

use cuda_device::{kernel, thread, DisjointSlice};

// Tier 1 security: race-free by development, no `unsafe` wanted.
// DisjointSlice::get_mut() solely accepts a ThreadIndex —
// a hardware-derived opaque kind guaranteeing distinctive writes per thread.
#[kernel]
pub fn scale(enter: &[f32], issue: f32, mut out: DisjointSlice<f32>) {
    let idx = thread::index_1d();
    if let Some(elem) = out.get_mut(idx) {
        *elem = enter[idx.get()] * issue;
    }
}
ⓘ Tier 1 Safety — the way it works
ThreadIndex is an opaque newtype round usize that can solely be created from {hardware} built-in registers (threadIdx, blockIdx, blockDim). Since every thread will get a singular worth, and DisjointSlice::get_mut() solely accepts a ThreadIndex, writes are race-free by development — no unsafe anyplace within the kernel.

Step 08 of 09  ·  Launch from Host

Launching the Kernel from Host Code

Host and machine code reside in the identical .rs file. The host aspect makes use of CudaContext, DeviceBuffer, and the cuda_launch! macro to handle GPU reminiscence and dispatch.

use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
use cuda_host::{cuda_launch, load_kernel_module};

fn major() {
    // Initialize GPU context on machine 0
    let ctx    = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();
    let module = load_kernel_module(&ctx, "scale_example").unwrap();

    // Upload enter information to GPU reminiscence
    let information: Vec<f32> = (0..1024).map(|i| i as f32).acquire();
    let enter  = DeviceBuffer::from_host(&stream, &information).unwrap();
    let mut output = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

    // Dispatch the kernel — LaunchConfig auto-sizes blocks/grids
    cuda_launch! {
        kernel: scale,
        stream: stream,
        module: module,
        config: LaunchConfig::for_num_elems(1024),
        args: [slice(input), 2.5f32, slice_mut(output)]
    }.unwrap();

    // Download outcome again to host
    let outcome = output.to_host_vec(&stream).unwrap();
    assert!((outcome[1] - 2.5).abs() < 1e-5);
    println!("✓ Kernel ran efficiently!");
}
ⓘ What cuda_launch! does
It scalarizes the argument listing — flattening slices, scalars, and captured closures — into PTX kernel parameters and dispatches the kernel on the given stream. No guide argument marshalling is required.

Step 09 of 09  ·  Next Steps

What to Explore Next

You have a working cuda-oxide setup. Here are the high-value paths ahead, ordered by complexity:

  • Generic kernels with monomorphization — attempt the generic instance (cargo oxide run generic) to see how fn scale<T: Copy> compiles to separate PTX kernels per kind.
  • Closures with captures — the host_closure instance exhibits how a transfer |x: f32| x * issue closure is scalarized and handed as PTX kernel parameters routinely.
  • Async GPU executioncuda_launch_async! returns a lazy DeviceOperation that executes on .sync() or .await. See the async_mlp and async_vecadd examples.
  • Shared reminiscence and warp intrinsics — these require scoped unsafe blocks with documented security contracts. See Tier 2 within the security mannequin documentation.
  • GEMM at Speed-of-Light — the gemm_sol instance achieves 868 TFLOPS on B200 (58% of cuBLAS SoL) utilizing cta_group::2, CLC, and a 4-stage pipeline.
  • Blackwell tensor cores — the tcgen05 instance targets sm_100a with TMEM, MMA, and cta_group::2. Requires LLVM 21+.
ⓘ Known Limitation in v0.1.0
index_2d(stride) is documented as at present unsound — if threads in the identical kernel use totally different stride values, two threads can get &mut T to the identical component with no unsafe in sight. Until the repair lands (lifting stride into a kind parameter), bind stride to a single let binding and reuse it at each name web site.

Full documentation: nvlabs.github.io/cuda-oxide  ·  Source: github.com/NVlabs/cuda-oxide

Document Created by Marktechpost.com

Key Takeaways

  • cuda-oxide is a customized rustc codegen backend from NVlabs that compiles #[kernel]-annotated Rust features to PTX by a Rust → rustc_public Stable MIR → Pliron IR → LLVM IR → PTX pipeline, all buildable with cargo.
  • Host and machine code coexist in a single .rs file, compiled with one cargo oxide construct command; the output is a number binary plus a .ptx file positioned subsequent to it.
  • The security mannequin has three documented tiers: Tier 1 (race-free by development through DisjointSlice<T> + ThreadIndex), Tier 2 (scoped unsafe for shared reminiscence, warp intrinsics, atomics), and Tier 3 (uncooked {hardware} intrinsics for TMA, WGMMA, tcgen05). index_2d(stride) is documented as at present unsound within the 0.x launch.
  • The gemm_sol instance hits 868 TFLOPS on the B200 (58% of cuBLAS SoL) utilizing a multi-phase GEMM pipeline with CLC and cta_group::2.

Check out the GitHub RepoAlso, be happy to comply with us on Twitter and don’t overlook to be a part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The submit NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX appeared first on MarkTechPost.

Similar Posts