How to Build High-Performance GPU-Accelerated Simulations and Differentiable Physics Workflows Using NVIDIA Warp Kernels

In this tutorial, we discover how to use NVIDIA Warp to construct high-performance GPU and CPU simulations instantly from Python. We start by establishing a Colab-compatible atmosphere and initializing Warp in order that our kernels can run on both CUDA GPUs or CPUs, relying on availability. We then implement a number of customized Warp kernels that display core parallel computing ideas, together with vector operations, procedural discipline technology, particle dynamics, and differentiable physics. By launching these kernels throughout 1000’s or thousands and thousands of threads, we observe how Warp permits environment friendly scientific computing and simulation workflows utilizing a easy Python interface. Throughout the tutorial, we construct a whole pipeline that spans from primary kernel execution to superior simulation and optimization. Check out Full Codes and Notebook.

Copy Code

import sys
import subprocess
import pkgutil


def _install_if_missing(packages):
   lacking = [p for p in packages if pkgutil.find_loader(p["import_name"]) is None]
   if lacking:
       subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + [p["pip_name"] for p in lacking])


_install_if_missing([
   {"import_name": "warp", "pip_name": "warp-lang"},
   {"import_name": "numpy", "pip_name": "numpy"},
   {"import_name": "matplotlib", "pip_name": "matplotlib"},
])


import math
import time
import numpy as np
import matplotlib.pyplot as plt
import warp as wp


wp.init()


gadget = "cuda:0" if wp.is_cuda_available() else "cpu"
print(f"Using Warp gadget: {gadget}")


@wp.kernel
def saxpy_kernel(a: wp.float32, x: wp.array(dtype=wp.float32), y: wp.array(dtype=wp.float32), out: wp.array(dtype=wp.float32)):
   i = wp.tid()
   out[i] = a * x[i] + y[i]


@wp.kernel
def image_sdf_kernel(width: int, top: int, pixels: wp.array(dtype=wp.float32)):
   tid = wp.tid()
   x = tid % width
   y = tid // width
   fx = 2.0 * (wp.float32(x) / wp.float32(width - 1)) - 1.0
   fy = 2.0 * (wp.float32(y) / wp.float32(top - 1)) - 1.0
   r1 = wp.sqrt((fx + 0.35) * (fx + 0.35) + fy * fy) - 0.28
   r2 = wp.sqrt((fx - 0.25) * (fx - 0.25) + (fy - 0.15) * (fy - 0.15)) - 0.18
   wave = fy + 0.25 * wp.sin(8.0 * fx)
   d = wp.min(r1, r2)
   d = wp.max(d, -wave)
   worth = wp.exp(-18.0 * wp.abs(d))
   pixels[tid] = worth

We arrange the atmosphere and make sure that all required libraries, reminiscent of Warp, NumPy, and Matplotlib, are put in. We initialize Warp and examine whether or not a CUDA GPU is out there, so our computations can run on the suitable gadget. We additionally outline the primary Warp kernels, together with a SAXPY vector operation and a procedural signed-distance discipline generator that demonstrates parallel kernel execution. Check out Full Codes and Notebook.

Copy Code

@wp.kernel
def init_particles_kernel(
   n_particles: int,
   px0: wp.array(dtype=wp.float32),
   py0: wp.array(dtype=wp.float32),
   vx0: wp.array(dtype=wp.float32),
   vy0: wp.array(dtype=wp.float32),
   px: wp.array(dtype=wp.float32),
   py: wp.array(dtype=wp.float32),
   vx: wp.array(dtype=wp.float32),
   vy: wp.array(dtype=wp.float32),
):
   p = wp.tid()
   px[p] = px0[p]
   py[p] = py0[p]
   vx[p] = vx0[p]
   vy[p] = vy0[p]


@wp.kernel
def simulate_particles_kernel(
   n_particles: int,
   dt: wp.float32,
   gravity: wp.float32,
   damping: wp.float32,
   bounce: wp.float32,
   radius: wp.float32,
   px: wp.array(dtype=wp.float32),
   py: wp.array(dtype=wp.float32),
   vx: wp.array(dtype=wp.float32),
   vy: wp.array(dtype=wp.float32),
):
   tid = wp.tid()
   s = tid // n_particles
   p = tid % n_particles
   i0 = s * n_particles + p
   i1 = (s + 1) * n_particles + p
   x = px[i0]
   y = py[i0]
   u = vx[i0]
   v = vy[i0]
   v = v + gravity * dt
   x = x + u * dt
   y = y + v * dt
   if y < radius:
       y = radius
       v = -bounce * v
       u = damping * u
   if x < -1.0 + radius:
       x = -1.0 + radius
       u = -bounce * u
   if x > 1.0 - radius:
       x = 1.0 - radius
       u = -bounce * u
   px[i1] = x
   py[i1] = y
   vx[i1] = u
   vy[i1] = v

We implement kernels accountable for initializing and simulating particle movement. We create a kernel that copies preliminary particle positions and velocities into the simulation state arrays. We then implement the particle simulation kernel that updates positions and velocities over time whereas making use of gravity, damping, and boundary collision habits. Check out Full Codes and Notebook.

Copy Code

@wp.kernel
def init_projectile_kernel(
   x_hist: wp.array(dtype=wp.float32),
   y_hist: wp.array(dtype=wp.float32),
   vx_hist: wp.array(dtype=wp.float32),
   vy_hist: wp.array(dtype=wp.float32),
   init_vx: wp.array(dtype=wp.float32),
   init_vy: wp.array(dtype=wp.float32),
):
   x_hist[0] = 0.0
   y_hist[0] = 0.0
   vx_hist[0] = init_vx[0]
   vy_hist[0] = init_vy[0]


@wp.kernel
def projectile_step_kernel(
   dt: wp.float32,
   gravity: wp.float32,
   x_hist: wp.array(dtype=wp.float32),
   y_hist: wp.array(dtype=wp.float32),
   vx_hist: wp.array(dtype=wp.float32),
   vy_hist: wp.array(dtype=wp.float32),
):
   s = wp.tid()
   x = x_hist[s]
   y = y_hist[s]
   vx = vx_hist[s]
   vy = vy_hist[s]
   vy = vy + gravity * dt
   x = x + vx * dt
   y = y + vy * dt
   if y < 0.0:
       y = 0.0
   x_hist[s + 1] = x
   y_hist[s + 1] = y
   vx_hist[s + 1] = vx
   vy_hist[s + 1] = vy


@wp.kernel
def projectile_loss_kernel(
   steps: int,
   target_x: wp.float32,
   target_y: wp.float32,
   x_hist: wp.array(dtype=wp.float32),
   y_hist: wp.array(dtype=wp.float32),
   loss: wp.array(dtype=wp.float32),
):
   dx = x_hist[steps] - target_x
   dy = y_hist[steps] - target_y
   loss[0] = dx * dx + dy * dy

We outline kernels for a differentiable projectile simulation. We initialize the projectile state and implement a time-stepping kernel that updates the trajectory underneath the affect of gravity. We additionally outline a loss kernel that computes the squared distance between the ultimate projectile place and a goal level, which permits gradient-based optimization. Check out Full Codes and Notebook.

Copy Code

n = 1_000_000
a = np.float32(2.5)
x_np = np.linspace(0.0, 1.0, n, dtype=np.float32)
y_np = np.linspace(1.0, 2.0, n, dtype=np.float32)


x_wp = wp.array(x_np, dtype=wp.float32, gadget=gadget)
y_wp = wp.array(y_np, dtype=wp.float32, gadget=gadget)
out_wp = wp.empty(n, dtype=wp.float32, gadget=gadget)


t0 = time.time()
wp.launch(kernel=saxpy_kernel, dim=n, inputs=[a, x_wp, y_wp], outputs=[out_wp], gadget=gadget)
wp.synchronize()
t1 = time.time()


out_np = out_wp.numpy()
anticipated = a * x_np + y_np
max_err = np.max(np.abs(out_np - anticipated))
print(f"SAXPY runtime: {t1 - t0:.4f}s, max error: {max_err:.6e}")


width, top = 512, 512
pixels_wp = wp.empty(width * top, dtype=wp.float32, gadget=gadget)
wp.launch(kernel=image_sdf_kernel, dim=width * top, inputs=[width, height], outputs=[pixels_wp], gadget=gadget)
wp.synchronize()
img = pixels_wp.numpy().reshape(top, width)


plt.determine(figsize=(6, 6))
plt.imshow(img, origin="decrease")
plt.title(f"Warp procedural discipline on {gadget}")
plt.axis("off")
plt.present()


n_particles = 256
steps = 300
dt = np.float32(0.01)
gravity = np.float32(-9.8)
damping = np.float32(0.985)
bounce = np.float32(0.82)
radius = np.float32(0.03)

We start operating the computational experiments utilizing Warp kernels. We execute the SAXPY kernel throughout a big vector to display high-throughput parallel computation and confirm numerical correctness. We additionally generate a procedural discipline picture utilizing the SDF kernel and visualize the consequence to observe how Warp kernels can produce structured numerical patterns. Check out Full Codes and Notebook.

Copy Code

angles = np.linspace(0.0, 2.0 * np.pi, n_particles, endpoint=False, dtype=np.float32)
px0_np = 0.4 * np.cos(angles).astype(np.float32)
py0_np = (0.7 + 0.15 * np.sin(angles)).astype(np.float32)
vx0_np = (-0.8 * np.sin(angles)).astype(np.float32)
vy0_np = (0.8 * np.cos(angles)).astype(np.float32)


px0_wp = wp.array(px0_np, dtype=wp.float32, gadget=gadget)
py0_wp = wp.array(py0_np, dtype=wp.float32, gadget=gadget)
vx0_wp = wp.array(vx0_np, dtype=wp.float32, gadget=gadget)
vy0_wp = wp.array(vy0_np, dtype=wp.float32, gadget=gadget)


state_size = (steps + 1) * n_particles
px_wp = wp.empty(state_size, dtype=wp.float32, gadget=gadget)
py_wp = wp.empty(state_size, dtype=wp.float32, gadget=gadget)
vx_wp = wp.empty(state_size, dtype=wp.float32, gadget=gadget)
vy_wp = wp.empty(state_size, dtype=wp.float32, gadget=gadget)


wp.launch(
   kernel=init_particles_kernel,
   dim=n_particles,
   inputs=[n_particles, px0_wp, py0_wp, vx0_wp, vy0_wp],
   outputs=[px_wp, py_wp, vx_wp, vy_wp],
   gadget=gadget,
)


wp.launch(
   kernel=simulate_particles_kernel,
   dim=steps * n_particles,
   inputs=[n_particles, dt, gravity, damping, bounce, radius],
   outputs=[px_wp, py_wp, vx_wp, vy_wp],
   gadget=gadget,
)
wp.synchronize()


px_traj = px_wp.numpy().reshape(steps + 1, n_particles)
py_traj = py_wp.numpy().reshape(steps + 1, n_particles)


sample_ids = np.linspace(0, n_particles - 1, 16, dtype=int)
plt.determine(figsize=(8, 6))
for idx in sample_ids:
   plt.plot(px_traj[:, idx], py_traj[:, idx], linewidth=1.5)
plt.axhline(radius, linestyle="--")
plt.xlim(-1.05, 1.05)
plt.ylim(0.0, 1.25)
plt.title(f"Warp particle trajectories on {gadget}")
plt.xlabel("x")
plt.ylabel("y")
plt.present()


proj_steps = 180
proj_dt = np.float32(0.025)
proj_g = np.float32(-9.8)
target_x = np.float32(3.8)
target_y = np.float32(0.0)


vx_value = np.float32(2.0)
vy_value = np.float32(6.5)
lr = 0.08
iters = 60


loss_history = []
vx_history = []
vy_history = []


for it in vary(iters):
   init_vx_wp = wp.array(np.array([vx_value], dtype=np.float32), dtype=wp.float32, gadget=gadget, requires_grad=True)
   init_vy_wp = wp.array(np.array([vy_value], dtype=np.float32), dtype=wp.float32, gadget=gadget, requires_grad=True)


   x_hist_wp = wp.zeros(proj_steps + 1, dtype=wp.float32, gadget=gadget, requires_grad=True)
   y_hist_wp = wp.zeros(proj_steps + 1, dtype=wp.float32, gadget=gadget, requires_grad=True)
   vx_hist_wp = wp.zeros(proj_steps + 1, dtype=wp.float32, gadget=gadget, requires_grad=True)
   vy_hist_wp = wp.zeros(proj_steps + 1, dtype=wp.float32, gadget=gadget, requires_grad=True)
   loss_wp = wp.zeros(1, dtype=wp.float32, gadget=gadget, requires_grad=True)


   tape = wp.Tape()
   with tape:
       wp.launch(
           kernel=init_projectile_kernel,
           dim=1,
           inputs=[],
           outputs=[x_hist_wp, y_hist_wp, vx_hist_wp, vy_hist_wp, init_vx_wp, init_vy_wp],
           gadget=gadget,
       )
       wp.launch(
           kernel=projectile_step_kernel,
           dim=proj_steps,
           inputs=[proj_dt, proj_g],
           outputs=[x_hist_wp, y_hist_wp, vx_hist_wp, vy_hist_wp],
           gadget=gadget,
       )
       wp.launch(
           kernel=projectile_loss_kernel,
           dim=1,
           inputs=[proj_steps, target_x, target_y],
           outputs=[x_hist_wp, y_hist_wp, loss_wp],
           gadget=gadget,
       )


   tape.backward(loss=loss_wp)
   wp.synchronize()


   current_loss = float(loss_wp.numpy()[0])
   grad_vx = float(init_vx_wp.grad.numpy()[0])
   grad_vy = float(init_vy_wp.grad.numpy()[0])


   vx_value = np.float32(vx_value - lr * grad_vx)
   vy_value = np.float32(vy_value - lr * grad_vy)


   loss_history.append(current_loss)
   vx_history.append(float(vx_value))
   vy_history.append(float(vy_value))


   if it % 10 == 0 or it == iters - 1:
       print(f"iter={it:02d} loss={current_loss:.6f} vx={vx_value:.4f} vy={vy_value:.4f} grad=({grad_vx:.4f}, {grad_vy:.4f})")


final_init_vx_wp = wp.array(np.array([vx_value], dtype=np.float32), dtype=wp.float32, gadget=gadget)
final_init_vy_wp = wp.array(np.array([vy_value], dtype=np.float32), dtype=wp.float32, gadget=gadget)
x_hist_wp = wp.zeros(proj_steps + 1, dtype=wp.float32, gadget=gadget)
y_hist_wp = wp.zeros(proj_steps + 1, dtype=wp.float32, gadget=gadget)
vx_hist_wp = wp.zeros(proj_steps + 1, dtype=wp.float32, gadget=gadget)
vy_hist_wp = wp.zeros(proj_steps + 1, dtype=wp.float32, gadget=gadget)


wp.launch(
   kernel=init_projectile_kernel,
   dim=1,
   inputs=[],
   outputs=[x_hist_wp, y_hist_wp, vx_hist_wp, vy_hist_wp, final_init_vx_wp, final_init_vy_wp],
   gadget=gadget,
)
wp.launch(
   kernel=projectile_step_kernel,
   dim=proj_steps,
   inputs=[proj_dt, proj_g],
   outputs=[x_hist_wp, y_hist_wp, vx_hist_wp, vy_hist_wp],
   gadget=gadget,
)
wp.synchronize()


x_path = x_hist_wp.numpy()
y_path = y_hist_wp.numpy()


fig = plt.determine(figsize=(15, 4))


ax1 = fig.add_subplot(1, 3, 1)
ax1.plot(loss_history)
ax1.set_title("Optimization loss")
ax1.set_xlabel("Iteration")
ax1.set_ylabel("Squared distance")


ax2 = fig.add_subplot(1, 3, 2)
ax2.plot(vx_history, label="vx")
ax2.plot(vy_history, label="vy")
ax2.set_title("Learned preliminary velocity")
ax2.set_xlabel("Iteration")
ax2.legend()


ax3 = fig.add_subplot(1, 3, 3)
ax3.plot(x_path, y_path, linewidth=2)
ax3.scatter([target_x], [target_y], s=80, marker="x")
ax3.set_title("Differentiable projectile trajectory")
ax3.set_xlabel("x")
ax3.set_ylabel("y")
ax3.set_ylim(-0.1, max(1.0, float(np.max(y_path)) + 0.3))


plt.tight_layout()
plt.present()


final_dx = float(x_path[-1] - target_x)
final_dy = float(y_path[-1] - target_y)
final_dist = math.sqrt(final_dx * final_dx + final_dy * final_dy)
print(f"Final goal miss distance: {final_dist:.6f}")
print(f"Optimized preliminary velocity: vx={vx_value:.6f}, vy={vy_value:.6f}")

We run a full particle simulation and visualize trajectories of chosen particles over time. We then carry out differentiable physics optimization utilizing Warp’s computerized differentiation and gradient tape mechanism to study an optimum projectile velocity that reaches a goal. Also, we visualize the optimization course of and the ensuing trajectory, demonstrating Warp’s simulation-driven optimization capabilities.

In this tutorial, we demonstrated how NVIDIA Warp permits extremely parallel numerical computations and simulations in Python, leveraging GPU acceleration. We constructed kernels for vector arithmetic, procedural picture technology, particle simulations, and differentiable projectile optimization, displaying how Warp integrates computation, visualization, and computerized differentiation inside a single framework. By executing these kernels on massive datasets and simulation states, we noticed how Warp offers each efficiency and flexibility for scientific computing duties.

Check out Full Codes and Notebook. Also, be happy to comply with us on Twitter and don’t neglect to be a part of our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The submit How to Build High-Performance GPU-Accelerated Simulations and Differentiable Physics Workflows Using NVIDIA Warp Kernels appeared first on MarkTechPost.