NVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication in Colab

In this tutorial, we implement a sophisticated hands-on workflow for NVIDIA cuTile Python, a tile-based GPU programming interface for writing environment friendly CUDA-style kernels instantly in Python. We begin by making ready a Colab-friendly setting, checking the obtainable GPU, driver, CUDA, and cuTile installations earlier than working any kernel code. We then construct tiled examples for vector addition, matrix addition, and matrix multiplication, whereas retaining a PyTorch fallback. Hence, the pocket book stays executable even when Colab doesn’t meet cuTile’s newest runtime necessities. Through this method, we perceive how tiled programming works, how tensors are loaded, computed, saved, and validated, and how customized GPU kernels may be in contrast in opposition to normal PyTorch operations.

Setting Up NVIDIA cuTile Python and Checking GPU, CUDA, and Driver Runtime in Colab

Copy Code

import os
import sys
import math
import time
import json
import shutil
import subprocess
import textwrap
import warnings
warnings.filterwarnings("ignore")
def run_cmd(cmd, verify=False, seize=True):
   print(f"n$ {cmd}")
   outcome = subprocess.run(
       cmd,
       shell=True,
       textual content=True,
       capture_output=seize
   )
   if seize:
       if outcome.stdout.strip():
           print(outcome.stdout.strip())
       if outcome.stderr.strip():
           print(outcome.stderr.strip())
   if verify and outcome.returncode != 0:
       elevate RuntimeError(f"Command failed: {cmd}")
   return outcome
print("=" * 90)
print("cuTile Python Advanced Colab Tutorial")
print("=" * 90)
print("n[1] Installing Python dependencies")
run_cmd(f"{sys.executable} -m pip set up -q -U pip setuptools wheel", verify=False)
run_cmd(f"{sys.executable} -m pip set up -q -U torch numpy pandas matplotlib", verify=False)
print("n[2] Trying to put in cuTile Python")
print("Package title on PyPI: cuda-tile[tileiras]")
install_result = run_cmd(
   f'{sys.executable} -m pip set up -q -U "cuda-tile[tileiras]"',
   verify=False
)
print("n[3] Runtime and GPU diagnostics")
run_cmd("python --version", verify=False)
run_cmd("nvidia-smi", verify=False)
strive:
   import torch
   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt
besides Exception as e:
   elevate RuntimeError(f"Core dependency import failed: {e}")
cuda_available = torch.cuda.is_available()
print(f"nPyTorch CUDA obtainable: {cuda_available}")
if cuda_available:
   device_name = torch.cuda.get_device_name(0)
   functionality = torch.cuda.get_device_capability(0)
   print(f"GPU: {device_name}")
   print(f"Compute functionality: sm_{functionality[0]}{functionality[1]}")
else:
   print("No CUDA GPU detected. Colab: Runtime -> Change runtime kind -> GPU")
def parse_driver_major():
   strive:
       out = subprocess.check_output(
           "nvidia-smi --query-gpu=driver_version --format=csv,noheader",
           shell=True,
           textual content=True
       ).strip().splitlines()[0]
       return int(out.break up(".")[0]), out
   besides Exception:
       return None, None
driver_major, driver_full = parse_driver_major()
print(f"NVIDIA driver model: {driver_full}")
ct = None
cutile_import_ok = False
strive:
   import cuda.tile as ct
   cutile_import_ok = True
   print("cuda.tile import: OK")
besides Exception as e:
   print("cuda.tile import: FAILED")
   print(str(e))
likely_runtime_ok = (
   cuda_available
   and cutile_import_ok
   and driver_major will not be None
   and driver_major >= 580
)
if likely_runtime_ok:
   print("ncuTile path is enabled.")
else:
   print("ncuTile path will not be enabled in this runtime.")
   print("The tutorial will nonetheless run utilizing a PyTorch fallback.")
   print("For actual cuTile execution, use a runtime with NVIDIA Driver R580+ and CUDA Toolkit 13.1+.")
DEVICE = "cuda" if cuda_available else "cpu"

We put together the Colab setting by putting in the required Python packages and making an attempt to put in cuTile Python. We then examine the obtainable runtime by checking Python, GPU, CUDA, and NVIDIA driver availability. We additionally resolve whether or not the pocket book can use the true cuTile backend or ought to proceed with the PyTorch fallback.

Building Timing, Correctness, and Benchmark Reporting Utilities for cuTile Kernels

Copy Code

print("n" + "=" * 90)
print("[4] Utilities: timing, correctness checks, and compact reporting")
print("=" * 90)
def sync():
   if torch.cuda.is_available():
       torch.cuda.synchronize()
def benchmark(fn, warmup=5, repeat=20, label="operate"):
   for _ in vary(warmup):
       fn()
   sync()
   occasions = []
   for _ in vary(repeat):
       begin = time.perf_counter()
       out = fn()
       sync()
       finish = time.perf_counter()
       occasions.append((finish - begin) * 1000)
   return {
       "label": label,
       "mean_ms": float(np.imply(occasions)),
       "median_ms": float(np.median(occasions)),
       "min_ms": float(np.min(occasions)),
       "max_ms": float(np.max(occasions)),
   }
def show_result_table(rows, title):
   df = pd.DataFrame(rows)
   print("n" + title)
   print(df.to_string(index=False))
   return df
def assert_close(title, precise, anticipated, atol=1e-4, rtol=1e-4):
   torch.testing.assert_close(precise, anticipated, atol=atol, rtol=rtol)
   print(f"{title}: correctness verify handed")

We outline helper features that make the tutorial simpler to run, check, and benchmark. We synchronize GPU execution, measure runtime throughout a number of repeats, and arrange benchmark outcomes into readable tables. We additionally add a correctness-checking operate to match every customized operation in opposition to the anticipated PyTorch output.

Defining Tiled cuTile Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication

Copy Code

print("n" + "=" * 90)
print("[5] cuTile kernels are outlined provided that cuda.tile imports efficiently")
print("=" * 90)
if cutile_import_ok:
   ConstInt = ct.Constant[int]
   @ct.kernel
   def cutile_vec_add_direct_kernel(a, b, c, TILE: ConstInt):
       bid = ct.bid(0)
       a_tile = ct.load(a, index=(bid,), form=(TILE,))
       b_tile = ct.load(b, index=(bid,), form=(TILE,))
       c_tile = a_tile + b_tile
       ct.retailer(c, index=(bid,), tile=c_tile)
   @ct.kernel
   def cutile_vec_add_gather_kernel(a, b, c, TILE: ConstInt):
       bid = ct.bid(0)
       offsets = bid * TILE + ct.arange(TILE, dtype=torch.int32)
       a_tile = ct.collect(a, offsets)
       b_tile = ct.collect(b, offsets)
       c_tile = a_tile + b_tile
       ct.scatter(c, offsets, c_tile)
   @ct.kernel
   def cutile_matrix_add_gather_kernel(a, b, c, TILE_M: ConstInt, TILE_N: ConstInt):
       bid_m = ct.bid(0)
       bid_n = ct.bid(1)
       rows = bid_m * TILE_M + ct.arange(TILE_M, dtype=torch.int32)
       cols = bid_n * TILE_N + ct.arange(TILE_N, dtype=torch.int32)
       rows = rows[:, None]
       cols = cols[None, :]
       a_tile = ct.collect(a, (rows, cols))
       b_tile = ct.collect(b, (rows, cols))
       c_tile = a_tile + b_tile
       ct.scatter(c, (rows, cols), c_tile)
   @ct.kernel
   def cutile_matmul_kernel(A, B, C, TM: ConstInt, TN: ConstInt, TK: ConstInt):
       bid_m = ct.bid(0)
       bid_n = ct.bid(1)
       num_tiles_k = ct.num_tiles(A, axis=1, form=(TM, TK))
       acc = ct.full((TM, TN), 0, dtype=ct.float32)
       zero_pad = ct.PaddingMode.ZERO
       compute_dtype = ct.tfloat32 if A.dtype == ct.float32 else A.dtype
       for okay in vary(num_tiles_k):
           a_tile = ct.load(
               A,
               index=(bid_m, okay),
               form=(TM, TK),
               padding_mode=zero_pad
           ).astype(compute_dtype)
           b_tile = ct.load(
               B,
               index=(okay, bid_n),
               form=(TK, TN),
               padding_mode=zero_pad
           ).astype(compute_dtype)
           acc = ct.mma(a_tile, b_tile, acc)
       out = ct.astype(acc, C.dtype)
       ct.retailer(C, index=(bid_m, bid_n), tile=out)
else:
   print("Skipping cuTile kernel definitions as a result of cuda.tile is unavailable.")
print("n" + "=" * 90)
print("[6] High-level wrappers")
print("=" * 90)
def vec_add_tutorial(a, b, use_gather=True):
   if a.form != b.form:
   if likely_runtime_ok and a.is_cuda:
       c = torch.empty_like(a)
       TILE = 256 if use_gather else min(1024, 2 ** math.ceil(math.log2(a.numel())))
       grid = (math.ceil(a.numel() / TILE), 1, 1)
       kernel = cutile_vec_add_gather_kernel if use_gather else cutile_vec_add_direct_kernel
       ct.launch(torch.cuda.current_stream(), grid, kernel, (a, b, c, TILE))
       return c
   return a + b
def matrix_add_tutorial(a, b):
   if a.form != b.form:
   if likely_runtime_ok and a.is_cuda:
       c = torch.empty_like(a)
       TILE_M = 16
       TILE_N = 64
       grid = (math.ceil(a.form[0] / TILE_M), math.ceil(a.form[1] / TILE_N), 1)
       ct.launch(
           torch.cuda.current_stream(),
           grid,
           cutile_matrix_add_gather_kernel,
           (a, b, c, TILE_M, TILE_N)
       )
       return c
   return a + b
def matmul_tutorial(A, B):
   if A.form[1] != B.form[0]:
       elevate ValueError("A.form[1] should equal B.form[0]")
   if likely_runtime_ok and A.is_cuda:
       if A.dtype in (torch.float16, torch.bfloat16):
           TM, TN, TK = 128, 128, 64
       else:
           TM, TN, TK = 32, 32, 32
       C = torch.empty((A.form[0], B.form[1]), machine=A.machine, dtype=A.dtype)
       grid = (math.ceil(A.form[0] / TM), math.ceil(B.form[1] / TN), 1)
       ct.launch(
           torch.cuda.current_stream(),
           grid,
           cutile_matmul_kernel,
           (A, B, C, TM, TN, TK)
       )
       return C
   return A @ B
print("Wrappers prepared.")
print(f"Execution backend: {'cuTile' if likely_runtime_ok else 'PyTorch fallback'}")

We outline the primary cuTile kernels for vector addition, matrix addition, and matrix multiplication when cuda.tile is on the market. We use tiled load, retailer, collect, scatter, and matrix-multiply operations as an instance how GPU computation is structured in cuTile. We then wrap these kernels inside Python features that mechanically fall again to PyTorch when the present runtime doesn’t assist cuTile.

Running Tiled Examples and Validating float32 and float16 Matmul Against PyTorch

Copy Code

print("n" + "=" * 90)
print("[7] Example 1: tiled vector addition")
print("=" * 90)
torch.manual_seed(42)
N = 1_000_003
a = torch.randn(N, machine=DEVICE, dtype=torch.float32)
b = torch.randn(N, machine=DEVICE, dtype=torch.float32)
c = vec_add_tutorial(a, b, use_gather=True)
anticipated = a + b
assert_close("Vector addition", c, anticipated)
print(f"Input form: {tuple(a.form)}")
print(f"Output form: {tuple(c.form)}")
print(f"First 5 output values: {c[:5].detach().cpu().numpy()}")
print("n" + "=" * 90)
print("[8] Example 2: tiled matrix addition with boundary-safe collect/scatter")
print("=" * 90)
M, N = 777, 1001
A = torch.randn(M, N, machine=DEVICE, dtype=torch.float32)
B = torch.randn(M, N, machine=DEVICE, dtype=torch.float32)
C = matrix_add_tutorial(A, B)
anticipated = A + B
assert_close("Matrix addition", C, anticipated)
print(f"A form: {tuple(A.form)}")
print(f"B form: {tuple(B.form)}")
print(f"C form: {tuple(C.form)}")
print("n" + "=" * 90)
print("[9] Example 3: tiled matrix multiplication")
print("=" * 90)
M, Ok, N = 512, 768, 384
A32 = torch.randn(M, Ok, machine=DEVICE, dtype=torch.float32)
B32 = torch.randn(Ok, N, machine=DEVICE, dtype=torch.float32)
if DEVICE == "cuda":
   torch.set_float32_matmul_precision("excessive")
C32 = matmul_tutorial(A32, B32)
expected32 = A32 @ B32
if DEVICE == "cuda":
   atol, rtol = 1e-2, 1e-2
else:
   atol, rtol = 1e-4, 1e-4
assert_close("Float32 matmul", C32, expected32, atol=atol, rtol=rtol)
print(f"A32 form: {tuple(A32.form)}")
print(f"B32 form: {tuple(B32.form)}")
print(f"C32 form: {tuple(C32.form)}")
print("n" + "=" * 90)
print("[10] Example 4: half precision matmul")
print("=" * 90)
if DEVICE == "cuda":
   A16 = torch.randn(M, Ok, machine=DEVICE, dtype=torch.float16)
   B16 = torch.randn(Ok, N, machine=DEVICE, dtype=torch.float16)
   C16 = matmul_tutorial(A16, B16)
   expected16 = A16 @ B16
   assert_close("Float16 matmul", C16, expected16, atol=5e-2, rtol=5e-2)
   print(f"A16 form: {tuple(A16.form)}")
   print(f"B16 form: {tuple(B16.form)}")
   print(f"C16 form: {tuple(C16.form)}")
else:
   print("Skipping float16 GPU matmul as a result of CUDA is unavailable.")

We run the precise examples for tiled vector addition, matrix addition, float32 matrix multiplication, and float16 matrix multiplication. We create random tensors, execute the tutorial features, and examine the outcomes with normal PyTorch operations. We additionally print tensor shapes and pattern outputs to substantiate that each stage behaves as anticipated.

Benchmarking cuTile Operations Against PyTorch and Visualizing Median Runtimes

Copy Code

print("n" + "=" * 90)
print("[11] Benchmarks")
print("=" * 90)
bench_rows = []
bench_rows.append(
   benchmark(
       lambda: vec_add_tutorial(a, b, use_gather=True),
       label=f"{'cuTile' if likely_runtime_ok else 'PyTorch'} vector add"
   )
)
bench_rows.append(
   benchmark(
       lambda: a + b,
       label="PyTorch vector add"
   )
)
bench_rows.append(
   benchmark(
       lambda: matrix_add_tutorial(A, B),
       label=f"{'cuTile' if likely_runtime_ok else 'PyTorch'} matrix add"
   )
)
bench_rows.append(
   benchmark(
       lambda: A + B,
       label="PyTorch matrix add"
   )
)
bench_rows.append(
   benchmark(
       lambda: matmul_tutorial(A32, B32),
       label=f"{'cuTile' if likely_runtime_ok else 'PyTorch'} fp32 matmul"
   )
)
bench_rows.append(
   benchmark(
       lambda: A32 @ B32,
       label="PyTorch fp32 matmul"
   )
)
bench_df = show_result_table(bench_rows, "Benchmark abstract in milliseconds")
print("n" + "=" * 90)
print("[12] Simple benchmark visualization")
print("=" * 90)
strive:
   plt.determine(figsize=(10, 5))
   plt.bar(bench_df["label"], bench_df["median_ms"])
   plt.xticks(rotation=35, ha="proper")
   plt.ylabel("Median time in ms")
   plt.title("cuTile tutorial benchmark comparability")
   plt.tight_layout()
   plt.present()
besides Exception as e:
   print(f"Plot skipped: {e}")
print("n" + "=" * 90)
print("[13] What to vary subsequent")
print("=" * 90)
next_steps = [
   {
       "experiment": "Tile size sweep",
       "what_to_change": "Change TILE, TILE_M, TILE_N, TM, TN, and TK",
       "why_it_matters": "Tile shape controls memory access, occupancy, and Tensor Core usage"
   },
   {
       "experiment": "Non-multiple dimensions",
       "what_to_change": "Use dimensions like 1003 x 771",
       "why_it_matters": "Tests padding, gather/scatter, and boundary behavior"
   },
   {
       "experiment": "Precision comparison",
       "what_to_change": "Compare float32, float16, and bfloat16",
       "why_it_matters": "Tensor Core paths are strongest for reduced precision"
   },
   {
       "experiment": "Operation fusion",
       "what_to_change": "Extend vector add to compute c = relu(a + b)",
       "why_it_matters": "Fusion reduces memory traffic and is a common GPU-kernel optimization"
   },
   {
       "experiment": "Attention kernel study",
       "what_to_change": "Study the repo's AttentionFMHA.py sample",
       "why_it_matters": "Attention shows why tiled kernels matter for transformer workloads"
   }
]
next_df = pd.DataFrame(next_steps)
print(next_df.to_string(index=False))
print("n" + "=" * 90)
print("Tutorial accomplished.")
print("=" * 90)
if likely_runtime_ok:
   print("Real cuTile kernels had been used.")
else:
   print("This runtime used the PyTorch fallback.")
   print("To run actual cuTile kernels, use a GPU machine with NVIDIA Driver R580+ and CUDA Toolkit 13.1+.")

We benchmark the tutorial operations and examine their median runtimes with these of equal PyTorch operations. We then visualize the benchmark outcomes utilizing a easy bar chart to make the efficiency comparability simpler to know. Finally, we checklist sensible subsequent experiments, reminiscent of tile-size tuning, precision comparability, operation fusion, and the research of superior cuTile samples like consideration.

Conclusion

In conclusion, we’ve an entire cuTile Python workflow that covers setting setup, kernel definition, execution, validation, and benchmarking. We carried out direct tile operations, collect/scatter-based indexing, and tiled matrix multiplication, and verified correctness in opposition to PyTorch outputs at each stage. The fallback path retains the tutorial sensible for Colab customers, whereas the cuTile path exhibits how the identical construction can run on a appropriate NVIDIA GPU setting. It offers us a place to begin for experimenting with tile sizes, precision codecs, fused operations, and extra superior GPU workloads reminiscent of consideration, layer normalization, and customized deep studying kernels.

Check out the Full Codes with Notebook here. Also, be at liberty to comply with us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up NVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication in Colab appeared first on MarkTechPost.

NVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication in Colab

Setting Up NVIDIA cuTile Python and Checking GPU, CUDA, and Driver Runtime in Colab

Building Timing, Correctness, and Benchmark Reporting Utilities for cuTile Kernels

Defining Tiled cuTile Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication

Running Tiled Examples and Validating float32 and float16 Matmul Against PyTorch

Benchmarking cuTile Operations Against PyTorch and Visualizing Median Runtimes

Conclusion

Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

A New Study from Harvard and Perplexity Finds AI Agents Perform 26 Minutes of Autonomous Work per Session vs 33 Seconds for Search

WorkOS Releases auth.md: An Open Agent Registration Protocol Built on OAuth Standards

Alibaba’s Qwen AI Releases Compact Dense Qwen3-VL 4B/8B (Instruct & Thinking) With FP8 Checkpoints

Unsloth vs Axolotl vs TRL vs LLaMA-Factory: A Fine-Tuning Framework Comparison on Speed, VRAM, and Multi-GPU

What are Optical Character Recognition (OCR) Models? Top Open-Source OCR Models

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Setting Up NVIDIA cuTile Python and Checking GPU, CUDA, and Driver Runtime in Colab

Building Timing, Correctness, and Benchmark Reporting Utilities for cuTile Kernels

Defining Tiled cuTile Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication

Running Tiled Examples and Validating float32 and float16 Matmul Against PyTorch

Benchmarking cuTile Operations Against PyTorch and Visualizing Median Runtimes

Conclusion

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!