|

NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes

In manufacturing inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. Cold-starting inference workloads on Kubernetes can take a number of minutes. During that point, GPUs are allotted however idle, producing no tokens and serving no requests.

‘Cold begin’ means the total sequence a mannequin server should full earlier than serving any request: pulling the container picture, loading mannequin weights into GPU reminiscence, warming up CUDA kernels, compiling or capturing CUDA graphs, and registering with the service discovery layer. This delay will increase the chance of SLA violations throughout site visitors spikes, because the system can’t scale rapidly sufficient to soak up sudden will increase in demand.

The cold-start latency for a single-GPU vLLM (v0.20.0) workload breaks into three segments: container/picture pull, engine initialization (weight loading, kernel warmup, graph compilation), and distributed runtime startup.

To deal with this, NVIDIA’s AI analysis crew has launched NVIDIA Dynamo Snapshot: a checkpoint/restore strategy for AI inference workloads on Kubernetes.

https://developer.nvidia.com/weblog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/?linkId=100000423964029

What is CRIU and cuda-checkpoint?

A working inference employee’s checkpointable state has two parts. Device state (GPU-side) contains CUDA contexts, streams, system reminiscence, and digital deal with mappings — this isn’t seen to the host. To serialize it, cuda-checkpoint makes use of the checkpointing functionality of the CUDA driver to dump the system state to CPU reminiscence of the method proudly owning every CUDA context. Host state (CPU-side) contains CPU reminiscence, threads, file descriptors, and namespaces. CRIU (Checkpoint/Restore in Userspace) walks the Linux kernel’s bookkeeping and serializes the method tree’s state to disk.

When checkpointing, the 2 instruments run so as: cuda-checkpoint dumps all system state into CPU reminiscence first, then CRIU dumps all host-side course of tree state to a folder in storage. When restoring on the identical or a unique node: CRIU restores the method tree from distributed storage comparable to NFS or SMB first, then cuda-checkpoint restores the GPU state from what’s now in CPU reminiscence onto the brand new GPUs.

CRIU is essentially a freeze-and-thaw mechanism. When a course of is restored, execution resumes on the actual instruction the place it was checkpointed, utterly unaware that checkpointing or restoration occurred. Because of this, any coordination required earlier than checkpointing comparable to quiescing the workload or after restoration comparable to re-establishing exterior state — should be dealt with externally via an orchestrator or workload-specific hooks.

How Dynamo Snapshot Works on Kubernetes

In Kubernetes, workloads run inside containers inside pods. Because CRIU checkpoints include references to the container’s writable filesystem layer, checkpointing is completed on the container degree so the method tree state and filesystem journey collectively.

NVIDIA offers a privileged DaemonSet, snapshot-agent, installable via a Helm chart. An agent runs on each node and handles checkpoint and restore for runc-managed containers with out requiring modifications to runc itself. On checkpoint, the agent waits for the workload’s readiness probe, invokes cuda-checkpoint and CRIU from the host aspect, and writes the artifact to shared storage. The workload might have created or deleted information native to the container (the overlay filesystem), which the agent additionally checkpoints after the CRIU stage.

On restore, the agent launches a light-weight placeholder pod, restores the overlay filesystem, and restores the CRIU/CUDA checkpoint into its namespaces. Each agent operates independently on its native node, permitting checkpoints and restores to parallelize naturally throughout the cluster.

This DaemonSet strategy was chosen over Kubernetes native checkpoint/restore assist in runc for three causes: it’s absolutely moveable with out relying on cloud-provider characteristic gates, it provides tighter management over CRIU for efficiency tuning, and it permits checkpoint artifacts to reside in versatile storage backends relatively than being embedded into OCI photos.

Quiesce/resume hooks: A Dynamo inference employee initializes in two ordered phases. First, engine initialization: communicators are initialized, weights are loaded, kernels are warmed up, and CUDA graphs are compiled. The employee is absolutely heat at this level however not but discoverable exterior its pod. Second, distributed runtime startup: the employee connects to the Dynamo management airplane and registers with the invention backend. Open TCP connections to the management airplane exist from this level onward.

If checkpoint had been taken after distributed runtime startup, there could be lively TCP connections that CRIU can’t seize. The resolution is quiesce/resume hooks: the employee writes a ‘prepared for checkpoint’ sign file after engine initialization however earlier than distributed runtime startup. The employee then enters a polling loop ready for a ‘restore full’ sign file whereas the snapshot agent checkpoints it externally. Because CRIU restores execution on the actual instruction the place checkpointing occurred, the employee resumes straight contained in the polling loop, detects the sign file, and proceeds with distributed runtime initialization with out requiring extra synchronization.

The quiesce/resume sample can also be vital for multi-GPU and multi-node checkpoints (deliberate for a future launch): outbound TCP connections used for RPC can’t be checkpointed in a longtime state as a result of the pod IP modifications between checkpoint and restore, and RDMA registrations and NIC state should be recreated post-restore.

Optimization 1: KV Cache Unmap and Release

After measuring peak GPU reminiscence utilization whereas weights, CUDA graphs, and different buffers are allotted, inference engines allocate the remaining GPU reminiscence as a big KV cache buffer. Since the checkpoint is taken earlier than the duplicate has served any requests, this KV cache buffer doesn’t should be checkpointed in any respect. However, its digital deal with should stay secure as a result of it’s baked into the CUDA graph.

The resolution is to allocate the KV cache through the CUDA Virtual Memory Management API (cuMemCreate and cuMemMap), then free the underlying bodily allocation with cuMemUnmap and cuMemRelease — however not cuMemAddressFree. This retains the digital deal with vary intact whereas releasing the bodily reminiscence. This performance is natively accessible in vLLM through sleep() and wake_up() and in SGLang through torch_memory_saver.

For Qwen3-0.6B on a B200, this reduces the entire artifact dimension from ~190 GiB to ~6 GiB. The wins are most pronounced for massive KV cache sizes — that’s, smaller mannequin weights relative to GPU dimension.

Optimization 2: Speeding Up CRIU Memory Restore

Even after the artifact is smaller, upstream CRIU restore time stays a bottleneck. For bigger fashions, restore time really exceeds cold-start time, which negates the advantage of checkpointing.

Note: The CRIU optimizations described beneath usually are not but shipped as a part of Dynamo Snapshot. They could also be accessible as soon as merged into upstream CRIU.

2.1 — Parallel memfd restore: vLLM’s sleep()/wake_up() path and SGLang’s torch_memory_saver transfer weight-tagged GPU allocations into pinned CPU shadow buffers. CUDA backs these allocations with shared nameless reminiscence, pinned via the NVIDIA driver. Inside the Linux kernel, these seem as memfds: nameless, RAM-backed information mapped with MAP_SHARED. For gpt-oss-120b, these buffers consumed greater than 120 GiB, break up throughout many impartial 2 GiB-or-smaller buffers. Upstream CRIU restores these buffers serially. The modified CRIU enumerates all distinctive shmem-backed objects, then makes use of a thread pool to revive them in parallel, permitting restore to make use of accessible storage bandwidth and CPU parallelism.

2.2 — Linux native AIO for nameless reminiscence: In upstream CRIU, the reminiscence restore path is a synchronous preadv loop with precisely one learn in flight at any second, leaving the storage system idle between requests. The substitute makes use of Linux native AIO: CRIU submits a batch of iocbs through io_submit and retains a sliding window of as much as 128 reads in flight concurrently. As completions arrive through io_getevents, new submissions backfill the window.

Where the storage backend helps it, each nameless and shared reminiscence reads use O_DIRECT, avoiding pointless web page cache stress in the course of the one-pass restore stream. Linux native AIO is just really asynchronous on information opened with O_DIRECT. On filesystems the place O_DIRECT is unavailable — comparable to some NFS deployments — restore falls again to buffered I/O with sequential readahead, and the good points from AIO are considerably decreased.

Combined outcomes throughout three fashions (checkpoint sizes after KV cache unmap):

Model Checkpoint Size CRIU (upstream) CRIU (AIO) CRIU (AIO + parallel memfd) Speedup SOL*
Qwen3-0.6B 6.2 GiB 6.8 s 2.9 s 2.4 s 2.8× 0.95 s
Qwen3-8B 26 GiB 24 s 11 s 4.7 s 5.1× 1.8 s
gpt-oss-120b 129 GiB 119 s 54 s 15 s 7.9× 11 s

*SOL (velocity of sunshine) is the theoretical most restore velocity given accessible storage bandwidth — the ground beneath which restore time can’t go.

At this level CRIU restore time is near SOL, however end-to-end restore remains to be dominated by transferring massive mannequin weights sequentially from storage via host reminiscence onto the GPU. This is a serial bottleneck: cuda-checkpoint can’t restore GPU reminiscence till CRIU materializes the weights in host reminiscence.

Optimization 3: GPU Memory Service (GMS)

To eradicate the serial weight-transfer bottleneck, NVIDIA’s analysis crew developed the GPU Memory Service (GMS). GMS makes use of the CUDA Virtual Memory Management (VMM) API to decouple massive mannequin weights from the inference employee’s course of lifetime, offloading the vast majority of course of reminiscence right into a separate GMS artifact. By eradicating weights from the core CRIU checkpoint, GMS permits course of state restoration and weight restoration to run concurrently utilizing completely different reminiscence bandwidth channels. Weight restoration can use the quickest accessible paths comparable to GPUDirect Storage (GDS) or peer-GPU RDMA/NVLink.

Checkpoint artifact sizes with GMS:

Model CRIU checkpoint (baseline) CRIU checkpoint (with GMS) GMS weight artifact
Qwen3-0.6B 6.2 GiB 4.3 GiB 1.2 GiB
Qwen3-8B 26 GiB 4.8 GiB 15 GiB
gpt-oss-120b 129 GiB 6.7 GiB 74 GiB

In a proof-of-concept weight restoration backend that stripes weights throughout 8 native NVMe SSDs, weight restoration completes in parallel with CRIU course of restore — bringing whole end-to-end startup time for gpt-oss-120b underneath 5 seconds, a 21× discount. Restore instances are measured from a typical restore set off timestamp, excluding container startup time.

Deployment: Kubernetes Resources

The deployment workflow makes use of three Kubernetes sources. The snapshot-agent DaemonSet is put in through Helm chart. The DynamoCheckpoint customized useful resource (shortname: dckpt) defines which mannequin configuration to checkpoint. The DynamoGraphDeployment CR references the checkpoint for restore.

Prerequisites from the documentation: x86_64 (amd64) GPU nodes; NVIDIA driver 580.xx or newer on GPU nodes (590.xx or newer for multi-GPU snapshots); ReadWriteMany storage for cross-node restore; present backend assist is vLLM solely, in restricted preview.

The DynamoCheckpoint id is a 16-character SHA256 hash of fields that have an effect on runtime state: mannequin, backendFramework, dynamoVersion, tensorParallelSize, pipelineParallelSize, dtype, maxModelLen, and extraParameters. Fields that don’t have an effect on the hash embody duplicate rely, node placement, useful resource limits, and observability configuration.

Two deployment modes exist. The specific checkpointRef mode references a prepared DynamoCheckpoint by identify. Auto mode has the operator compute the id hash, look for an identical DynamoCheckpoint, and create one solely when no match exists — the primary employee cold-starts and the checkpoint is created within the background for subsequent scale occasions.

Current limitations: checkpoint/restore helps vLLM employees solely in restricted preview; specialised employees (multimodal, embedding, diffusion) usually are not supported; multi-GPU tensor-parallel configurations have restricted validation; GMS restore isn’t but accessible; snapshot-agent should run privileged; and restore is delicate to reside TCP socket state.

Key Takeaways

  • Dynamo Snapshot makes use of CRIU and cuda-checkpoint to freeze and restore single-GPU inference employees on Kubernetes, avoiding full cold-start latency.
  • KV cache unmap through cuMemUnmap and cuMemRelease reduces checkpoint artifact dimension from ~190 GiB to ~6 GiB for Qwen3-0.6B on a B200.
  • Linux native AIO and parallel memfd restore lower CRIU restore time by as much as 7.9× over upstream CRIU; these optimizations are pending upstream CRIU merge.
  • The GPU Memory Service (GMS) decouples mannequin weights from the CRIU artifact, enabling concurrent course of and weight restoration over channels like GPUDirect Storage.
  • In a proof-of-concept utilizing 8 striped native NVMe SSDs, gpt-oss-120b startup time is decreased by 21× to underneath 5 seconds.

Marktechpost’s Visual Explainer

NVIDIA
Dynamo Snapshot — Kubernetes Inference Guide

1 / 10

01 — Overview
What Is NVIDIA Dynamo Snapshot?

In manufacturing inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically.
Cold-starting inference workloads on Kubernetes can take a number of minutes. During that point, GPUs are allotted however idle,
producing no tokens and serving no requests.

NVIDIA Dynamo Snapshot is a checkpoint/restore system for AI inference workloads on Kubernetes.
It serializes the total state of a working inference employee — each GPU-side and CPU-side — and restores it on the identical or a unique node,
skipping the cold-start sequence totally.

21xStartup speedup
gpt-oss-120b (PoC)
<5sRestore time
8× NVMe SSDs (PoC)
6 GiBCheckpoint dimension
vs ~190 GiB baseline

02 — Core Tools
CRIU and cuda-checkpoint

A working inference employee has two forms of checkpointable state. Dynamo Snapshot makes use of one software per kind:

  • cuda-checkpoint — serializes GPU system state (CUDA contexts, streams, system reminiscence, digital deal with mappings) into CPU reminiscence of the method proudly owning every CUDA context. Uses the checkpointing functionality of the CUDA driver.
  • CRIU (Checkpoint/Restore in Userspace) — walks Linux kernel bookkeeping and serializes the host-side course of tree (CPU reminiscence, threads, file descriptors, namespaces) to disk.
Checkpoint order: cuda-checkpoint dumps GPU state to CPU reminiscence first, then CRIU dumps all host-side state to storage.
Restore order: CRIU reconstructs the method tree from storage first, then cuda-checkpoint restores GPU state from what’s now in CPU reminiscence onto the brand new GPUs.

03 — Kubernetes Architecture
The snapshot-agent DaemonSet

Dynamo Snapshot is deployed as a privileged DaemonSet known as snapshot-agent, put in through a Helm chart.
One agent runs on each node and handles checkpoint and restore for runc-managed containers with out modifying runc itself.

  • On checkpoint: agent waits for the workload readiness probe, invokes cuda-checkpoint and CRIU, then writes the artifact to shared storage. The overlay filesystem can also be checkpointed after the CRIU stage.
  • On restore: agent launches a light-weight placeholder pod, restores the overlay filesystem, then restores the CRIU/CUDA checkpoint into its namespaces.
  • Parallelism: every agent operates independently on its native node, so checkpoints and restores parallelize naturally throughout the cluster.
  • Portability: no cloud-provider characteristic gate dependency; checkpoint artifacts reside in versatile storage backends, not embedded in OCI photos.

04 — Workload Coordination
Quiesce/Resume Hooks

A Dynamo inference employee initializes in two ordered phases. The checkpoint should be taken between them:

  • Phase 1 — Engine initialization: communicators initialized, weights loaded, kernels warmed up, CUDA graphs compiled. Worker is absolutely heat however not but discoverable exterior its pod.
  • Phase 2 — Distributed runtime startup: employee connects to the Dynamo management airplane and registers with the invention backend. Open TCP connections exist from this level onward — CRIU can’t seize them.
Implementation: the employee writes a "prepared for checkpoint" sign file after Phase 1 however earlier than Phase 2, then enters a polling loop.
The snapshot agent checkpoints it whereas it waits. On restore, CRIU resumes execution contained in the polling loop.
The employee detects the "restore full" sign file and proceeds with Phase 2 — no additional synchronization wanted.

05 — Optimization 1
KV Cache Unmap and Release

Inference engines allocate remaining GPU reminiscence as a big KV cache buffer after weights and CUDA graphs are positioned.
Since the checkpoint is taken earlier than any requests are served, the KV cache contents don’t should be checkpointed.
However, its digital deal with should keep secure as a result of it’s baked into the CUDA graph.

  • Allocate the KV cache through the CUDA Virtual Memory Management API: cuMemCreate + cuMemMap.
  • Free bodily reminiscence with cuMemUnmap + cuMemRelease — however not cuMemAddressFree. The digital deal with vary stays intact.
  • Already natively accessible in vLLM through sleep() / wake_up() and in SGLang through torch_memory_saver.
~190GiB — earlier than unmap
Qwen3-0.6B on B200
~6GiB — after unmap
identical mannequin, identical GPU

06 — Optimization 2.1
Parallel memfd Restore

vLLM’s sleep()/wake_up() and SGLang’s torch_memory_saver transfer weight-tagged GPU allocations
into pinned CPU shadow buffers. CUDA backs these with shared nameless reminiscence, which seem within the Linux kernel as
memfds: nameless, RAM-backed information mapped with MAP_SHARED.

  • For gpt-oss-120b, these buffers consumed greater than 120 GiB, break up throughout many impartial ≤2 GiB buffers.
  • Upstream CRIU restores these buffers serially: create one shmem-backed object, resize, map, learn, then transfer to the subsequent.
  • Modified CRIU enumerates all distinctive shmem-backed objects, then launches a thread pool to revive them in parallel. Each employee allocates its buffer and reads from the checkpoint independently.
Note: These CRIU optimizations usually are not but shipped as a part of Dynamo Snapshot. They will likely be accessible as soon as merged into upstream CRIU.

07 — Optimization 2.2
Linux Native AIO for Anonymous Memory

After restoring shared sources, CRIU should fill every course of’s non-public reminiscence: heap pages, stacks, nameless mappings,
and copy-on-write non-public file mappings — on the actual digital addresses they’d earlier than checkpoint.

  • Upstream CRIU: synchronous preadv loop — one learn in flight at a time. Storage system is idle between requests. Cannot saturate quick NVMe bandwidth.
  • Modified CRIU: Linux native AIO. Submits batches of iocbs through io_submit, retains a sliding window of as much as 128 reads in flight. Completions arrive through io_getevents; new submissions backfill the window.
  • Where supported, each nameless and shared reminiscence reads use O_DIRECT to keep away from pointless web page cache stress. AIO is just really asynchronous on O_DIRECT information — on some NFS deployments with out O_DIRECT, good points are decreased.

08 — Results
CRIU Restore Time Comparison
Combined outcomes throughout three fashions after KV cache unmap. SOL = velocity of sunshine (theoretical most restore velocity given accessible storage bandwidth).

Model Ckpt Size Upstream AIO solely AIO + memfd Speedup SOL
Qwen3-0.6B 6.2 GiB 6.8 s 2.9 s 2.4 s 2.8x 0.95 s
Qwen3-8B 26 GiB 24 s 11 s 4.7 s 5.1x 1.8 s
gpt-oss-120b 129 GiB 119 s 54 s 15 s 7.9x 11 s
Note: These CRIU optimizations are pending upstream CRIU merge and usually are not but shipped in Dynamo Snapshot.

09 — Optimization 3
GPU Memory Service (GMS)

Even after optimizing CRIU, a tough bottleneck remained: cuda-checkpoint can’t restore GPU reminiscence till CRIU
absolutely materializes the weights in host reminiscence — a serial dependency. GMS eliminates it.

  • GMS makes use of the CUDA Virtual Memory Management (VMM) API to decouple massive mannequin weights from the inference employee’s course of lifetime right into a separate GMS artifact.
  • Process state (CRIU) and weight restoration (GMS) run concurrently utilizing completely different reminiscence bandwidth channels.
  • Weight restoration can use the quickest accessible paths: GPUDirect Storage (GDS) or peer-GPU RDMA/NVLink.
Model CRIU (baseline) CRIU + GMS GMS artifact
Qwen3-0.6B 6.2 GiB 4.3 GiB 1.2 GiB
Qwen3-8B 26 GiB 4.8 GiB 15 GiB
gpt-oss-120b 129 GiB 6.7 GiB 74 GiB

10 — Deployment & Roadmap
Deploying Snapshot & What’s Next
Prerequisites (from docs.nvidia.com/dynamo v1.1.1):

  • x86_64 (amd64) GPU nodes; NVIDIA driver 580.xx+ (590.xx+ for multi-GPU snapshots)
  • ReadWriteMany storage for cross-node restore
  • vLLM backend solely — restricted preview; specialised employees (multimodal, embedding, diffusion) not supported
  • Checkpoint id is a 16-character SHA256 hash of: mannequin, backendFramework, tensorParallelSize, dtype, maxModelLen, dynamoVersion, pipelineParallelSize, extraParameters

Roadmap (in progress):

  • GMS restore path with pluggable backends (GDS, UCX) — gated on pending CUDA driver patch
  • TensorRT-LLM assist
  • Multi-GPU and multi-node assist through quiesce/resume hooks for PyTorch, NCCL, NIXL


Check out the Technical detailsAlso, be happy to comply with us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes appeared first on MarkTechPost.

Similar Posts