A Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling

In this tutorial, we delve into CuPy as a robust GPU-accelerated different to NumPy for high-performance numerical computing in Python. We begin by inspecting the accessible CUDA system, checking the CuPy model, runtime particulars, GPU reminiscence, and compute functionality in order that we perceive the {hardware} surroundings earlier than operating heavy computations. Then, we examine NumPy and CuPy on massive matrix multiplication and FFT workloads to see how GPU acceleration adjustments execution pace. Also, we work with reminiscence swimming pools, customized elementwise kernels, discount kernels, uncooked CUDA kernels, CUDA streams, sparse matrices, dense linear solvers, GPU picture processing, DLPack interoperability, event-based profiling, cupyx.jit, and kernel fusion. Through these examples, we construct a sensible understanding of how CuPy lets us write acquainted Python code whereas nonetheless accessing superior CUDA-level efficiency options.

Copy Code

import sys, time, subprocess
strive:
   import cupy as cp
besides ImportError:
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "cupy-cuda12x"])
   import cupy as cp
import numpy as np
import matplotlib.pyplot as plt
from cupyx.scipy import sparse as cps
from cupyx.scipy import ndimage as cdi
from cupyx import jit
def header(t): print("n" + "="*64 + f"n{t}n" + "="*64)
def bench(fn, *args, n=5, warmup=2, gpu=True):
   for _ in vary(warmup): fn(*args)
   if gpu: cp.cuda.Stream.null.synchronize()
   t0 = time.perf_counter()
   for _ in vary(n): r = fn(*args)
   if gpu: cp.cuda.Stream.null.synchronize()
   return (time.perf_counter() - t0) / n
header("1. GPU INTROSPECTION")
props = cp.cuda.runtime.getDeviceProperties(0)
print(f"CuPy model       : {cp.__version__}")
print(f"CUDA runtime       : {cp.cuda.runtime.runtimeGetVersion()}")
print(f"Device             : {props['name'].decode()}")
print(f"Compute functionality : {props['major']}.{props['minor']}")
print(f"SMs                : {props['multiProcessorCount']}")
print(f"Global reminiscence      : {props['totalGlobalMem']/1e9:.2f} GB")
header("2. NUMPY vs CUPY BENCHMARK")
N = 4096
A_np = np.random.rand(N, N).astype(np.float32)
B_np = np.random.rand(N, N).astype(np.float32)
A_cp, B_cp = cp.asarray(A_np), cp.asarray(B_np)
t_np = bench(np.matmul, A_np, B_np, n=2, gpu=False)
t_cp = bench(cp.matmul, A_cp, B_cp, n=3, gpu=True)
print(f"Matmul {N}x{N}  NumPy={t_np*1000:7.1f} ms  CuPy={t_cp*1000:7.1f} ms  ({t_np/t_cp:.1f}x)")
x_np = np.random.rand(2**21).astype(np.complex64)
x_cp = cp.asarray(x_np)
t_np = bench(np.fft.fft, x_np, n=3, gpu=False)
t_cp = bench(cp.fft.fft, x_cp, n=5, gpu=True)
print(f"FFT 2^21        NumPy={t_np*1000:7.1f} ms  CuPy={t_cp*1000:7.1f} ms  ({t_np/t_cp:.1f}x)")

We start by organising CuPy, NumPy, Matplotlib, sparse utilities, image-processing instruments, and JIT assist so the tutorial has all of the required GPU-computing elements. We outline helper capabilities for part headers and dependable benchmarking, then examine the accessible CUDA system to perceive the GPU surroundings. We additionally examine NumPy and CuPy for big matrix multiplication and FFT operations to observe the efficiency distinction between CPU-based and GPU-accelerated computation.

Copy Code

header("3. MEMORY POOL")
pool = cp.get_default_memory_pool()
pinned = cp.get_default_pinned_memory_pool()
print(f"Used  : {pool.used_bytes()/1e6:8.2f} MB")
print(f"Total : {pool.total_bytes()/1e6:8.2f} MB")
del A_cp, B_cp, x_cp
pool.free_all_blocks(); pinned.free_all_blocks()
print(f"After free_all_blocks → Used: {pool.used_bytes()/1e6:.2f} MB")
header("4. ELEMENTWISE KERNEL")
robust_norm = cp.ElementwiseKernel(
   in_params ='float32 x, float32 y, float32 eps',
   out_params='float32 z',
   operation ='z = sqrtf((x - y)*(x - y) + eps)',
   title      ='robust_norm')
x = cp.random.rand(2_000_000, dtype=cp.float32)
y = cp.random.rand(2_000_000, dtype=cp.float32)
z = robust_norm(x, y, cp.float32(1e-6))
print(f"Output form={z.form}  imply={float(z.imply()):.5f}")
header("5. REDUCTION KERNEL — L2 NORM")
l2 = cp.ReductionKernel(
   in_params  ='T x',
   out_params ='T y',
   map_expr   ='x * x',
   reduce_expr='a + b',
   post_map_expr='y = sqrt(a)',
   identification   ='0',
   title       ='l2norm')
v = cp.random.rand(5_000_000, dtype=cp.float32)
print(f"Custom : {float(l2(v)):.6f}")
print(f"cupy   : {float(cp.linalg.norm(v)):.6f}")

We study CuPy’s reminiscence pool to perceive how GPU reminiscence is allotted, reused, and launched throughout execution. We then create a customized ElementwiseKernel to carry out a per-element strong distance calculation straight on the GPU. After that, we outline a customized ReductionKernel for L2 norm computation and examine its consequence with CuPy’s built-in linear algebra perform.

Copy Code

header("6. RAW CUDA KERNEL — MANDELBROT")
mandel = cp.RawKernel(r'''
extern "C" __global__
void mandel(float xmin, float xmax, float ymin, float ymax,
           int W, int H, int max_iter, int* out) {
   int ix = blockDim.x * blockIdx.x + threadIdx.x;
   int iy = blockDim.y * blockIdx.y + threadIdx.y;
   if (ix >= W || iy >= H) return;
   float cx = xmin + (xmax - xmin) * ix / (W - 1);
   float cy = ymin + (ymax - ymin) * iy / (H - 1);
   float zx = 0.f, zy = 0.f;
   int it = 0;
   whereas (zx*zx + zy*zy < 4.f && it < max_iter) {
       float t = zx*zx - zy*zy + cx;
       zy = 2.f*zx*zy + cy;
       zx = t; ++it;
   }
   out[iy*W + ix] = it;
}
''', 'mandel')
W, H, ITER = 1024, 1024, 400
img = cp.zeros((H, W), dtype=cp.int32)
threads = (16, 16)
blocks = ((W + 15)//16, (H + 15)//16)
mandel(blocks, threads,
      (cp.float32(-2.0), cp.float32(1.0),
       cp.float32(-1.5), cp.float32(1.5),
       W, H, ITER, img))
cp.cuda.Stream.null.synchronize()
print(f"Mandelbrot completed. max iter reached={int(img.max())}")
plt.determine(figsize=(6,6))
plt.imshow(cp.asnumpy(cp.log1p(img)), cmap='twilight_shifted', extent=[-2,1,-1.5,1.5])
plt.title("Mandelbrot set — computed with a CuPy RawKernel")
plt.axis('off'); plt.present()
header("7. CUDA STREAMS")
s1, s2 = cp.cuda.Stream(non_blocking=True), cp.cuda.Stream(non_blocking=True)
with s1:
   a1 = cp.random.rand(2000, 2000, dtype=cp.float32)
   b1 = cp.random.rand(2000, 2000, dtype=cp.float32)
   c1 = a1 @ b1
with s2:
   a2 = cp.random.rand(2000, 2000, dtype=cp.float32)
   b2 = cp.random.rand(2000, 2000, dtype=cp.float32)
   c2 = a2 @ b2
s1.synchronize(); s2.synchronize()
print(f"Stream-1 imply={float(c1.imply()):.4f}")
print(f"Stream-2 imply={float(c2.imply()):.4f}")

We use a uncooked CUDA C kernel by means of CuPy’s RawKernel interface to compute the Mandelbrot set straight on the GPU. We launch the kernel with customized thread and block dimensions, synchronize execution, and visualize the ensuing fractal utilizing Matplotlib. We additionally discover CUDA streams by operating two impartial matrix multiplications concurrently and checking the output means from each streams.

Copy Code

header("8. SPARSE LINEAR ALGEBRA")
N, density = 8000, 5e-4
nnz = int(N*N*density)
knowledge = cp.random.rand(nnz, dtype=cp.float32)
rows = cp.random.randint(0, N, nnz)
cols = cp.random.randint(0, N, nnz)
A_sp = cps.csr_matrix((knowledge, (rows, cols)), form=(N, N))
xv   = cp.random.rand(N, dtype=cp.float32)
print(f"NNZ           : {A_sp.nnz}")
print(f"Sparse matvec : {bench(lambda: A_sp @ xv)*1000:.3f} ms")
A_dense = A_sp.toarray()
print(f"Dense  matvec : {bench(lambda: A_dense @ xv)*1000:.3f} ms")
header("9. LINEAR SYSTEM Ax = b")
N = 2000
M = cp.random.rand(N, N, dtype=cp.float32)
A = M @ M.T + N * cp.eye(N, dtype=cp.float32)
b = cp.random.rand(N, dtype=cp.float32)
x_sol = cp.linalg.clear up(A, b)
res   = cp.linalg.norm(A @ x_sol - b) / cp.linalg.norm(b)
print(f"Solved {N}x{N} SPD system. Relative residual = {float(res):.2e}")
header("10. GAUSSIAN FILTER ON GPU")
huge = cp.random.rand(4096, 4096, dtype=cp.float32)
t = bench(cdi.gaussian_filter, huge, 5.0, n=3)
print(f"4096x4096 Gaussian σ=5  →  {t*1000:.2f} ms")
header("11. INTEROP & ZERO-COPY (DLPack)")
g = cp.arange(8, dtype=cp.float32)
h = cp.asnumpy(g)
again = cp.asarray(h)
dl = g.toDlpack()
restored = cp.from_dlpack(dl)
print(f"NumPy view : {h}")
print(f"DLPack RT  : {restored}")

We use sparse linear algebra by producing a random sparse CSR matrix and evaluating sparse matrix-vector multiplication with dense multiplication. We then clear up a big symmetric optimistic particular linear system utilizing CuPy’s dense linear algebra instruments and confirm the answer by means of a relative residual. Finally, we apply a Gaussian filter to a big image-like array on the GPU and exhibit interoperability between NumPy, CuPy, and DLPack for knowledge motion and zero-copy alternate.

Copy Code

header("12. CUDA EVENTS")
A = cp.random.rand(4000, 4000, dtype=cp.float32)
B = cp.random.rand(4000, 4000, dtype=cp.float32)
e0, e1 = cp.cuda.Event(), cp.cuda.Event()
e0.file(); C = A @ B; e1.file(); e1.synchronize()
print(f"4000x4000 matmul = {cp.cuda.get_elapsed_time(e0, e1):.3f} ms (CUDA occasions)")
header("13. cupyx.jit — SAXPY")
@jit.rawkernel()
def saxpy(a, x, y, out, n):
   tid = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x
   if tid < n:
       out[tid] = a * x[tid] + y[tid]
n = 2_000_000
xv = cp.random.rand(n, dtype=cp.float32)
yv = cp.random.rand(n, dtype=cp.float32)
out = cp.empty_like(xv)
TPB = 256
blocks = (n + TPB - 1) // TPB
saxpy((blocks,), (TPB,), (cp.float32(2.5), xv, yv, out, n))
print("Correctness:", bool(cp.allclose(out, 2.5*xv + yv)))
header("14. KERNEL FUSION with @cp.fuse")
@cp.fuse()
def fused(x, y, z):
   return cp.sqrt(x*x + y*y + z*z) * cp.exp(-0.5*(x+y+z))
def unfused(x, y, z):
   return cp.sqrt(x*x + y*y + z*z) * cp.exp(-0.5*(x+y+z))
n = 4_000_000
x = cp.random.rand(n, dtype=cp.float32)
y = cp.random.rand(n, dtype=cp.float32)
z = cp.random.rand(n, dtype=cp.float32)
fused(x, y, z)
t1 = bench(unfused, x, y, z)
t2 = bench(fused,   x, y, z)
print(f"Unfused : {t1*1e3:6.3f} ms")
print(f"Fused   : {t2*1e3:6.3f} ms   (speedup {t1/t2:.2f}x)")
print("n" + "="*64)
print("DONE — discover: cupy.linalg, cupyx.scipy.sign, cupy.cuda.Graph")
print("="*64)

We profile a big GPU matrix multiplication utilizing CUDA occasions to acquire correct device-side timing. We then write a SAXPY kernel with cupyx.jit, launch it manually, and confirm its correctness towards the equal CuPy expression. Also, we use @cp.fuse to mix a number of array operations right into a fused kernel and examine its pace with the unfused model.

In conclusion, we gained a whole hands-on overview of CuPy’s superior GPU computing capabilities. We discovered how to benchmark GPU operations appropriately, handle GPU reminiscence, create customized kernels, run concurrent CUDA streams, course of sparse and dense numerical issues, and apply GPU acceleration to picture filtering and scientific workloads. We additionally explored interoperability by means of NumPy and DLPack, profile operations utilizing CUDA occasions, and improved efficiency with JIT kernels and fused computations. Also, we noticed how CuPy offers NumPy-like syntax whereas permitting us to delve deeper into CUDA programming after we want extra management, pace, and scalability for real-world numerical and scientific computing duties.

Check out the Full Codes and Notebook here. Also, be at liberty to observe us on Twitter and don’t neglect to be part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The submit A Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling appeared first on MarkTechPost.