NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes
In manufacturing inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. Cold-starting inference workloads on Kubernetes can take a number of minutes. During that point, GPUs are allotted however idle, producing no tokens and serving no requests.
‘Cold begin’ means the total sequence a mannequin server should full earlier than serving any request: pulling the container picture, loading mannequin weights into GPU reminiscence, warming up CUDA kernels, compiling or capturing CUDA graphs, and registering with the service discovery layer. This delay will increase the chance of SLA violations throughout site visitors spikes, because the system can’t scale rapidly sufficient to soak up sudden will increase in demand.
The cold-start latency for a single-GPU vLLM (v0.20.0) workload breaks into three segments: container/picture pull, engine initialization (weight loading, kernel warmup, graph compilation), and distributed runtime startup.
To deal with this, NVIDIA’s AI analysis crew has launched NVIDIA Dynamo Snapshot: a checkpoint/restore strategy for AI inference workloads on Kubernetes.

What is CRIU and cuda-checkpoint?
A working inference employee’s checkpointable state has two parts. Device state (GPU-side) contains CUDA contexts, streams, system reminiscence, and digital deal with mappings — this isn’t seen to the host. To serialize it, cuda-checkpoint makes use of the checkpointing functionality of the CUDA driver to dump the system state to CPU reminiscence of the method proudly owning every CUDA context. Host state (CPU-side) contains CPU reminiscence, threads, file descriptors, and namespaces. CRIU (Checkpoint/Restore in Userspace) walks the Linux kernel’s bookkeeping and serializes the method tree’s state to disk.
When checkpointing, the 2 instruments run so as: cuda-checkpoint dumps all system state into CPU reminiscence first, then CRIU dumps all host-side course of tree state to a folder in storage. When restoring on the identical or a unique node: CRIU restores the method tree from distributed storage comparable to NFS or SMB first, then cuda-checkpoint restores the GPU state from what’s now in CPU reminiscence onto the brand new GPUs.
CRIU is essentially a freeze-and-thaw mechanism. When a course of is restored, execution resumes on the actual instruction the place it was checkpointed, utterly unaware that checkpointing or restoration occurred. Because of this, any coordination required earlier than checkpointing comparable to quiescing the workload or after restoration comparable to re-establishing exterior state — should be dealt with externally via an orchestrator or workload-specific hooks.
How Dynamo Snapshot Works on Kubernetes
In Kubernetes, workloads run inside containers inside pods. Because CRIU checkpoints include references to the container’s writable filesystem layer, checkpointing is completed on the container degree so the method tree state and filesystem journey collectively.
NVIDIA offers a privileged DaemonSet, snapshot-agent, installable via a Helm chart. An agent runs on each node and handles checkpoint and restore for runc-managed containers with out requiring modifications to runc itself. On checkpoint, the agent waits for the workload’s readiness probe, invokes cuda-checkpoint and CRIU from the host aspect, and writes the artifact to shared storage. The workload might have created or deleted information native to the container (the overlay filesystem), which the agent additionally checkpoints after the CRIU stage.
On restore, the agent launches a light-weight placeholder pod, restores the overlay filesystem, and restores the CRIU/CUDA checkpoint into its namespaces. Each agent operates independently on its native node, permitting checkpoints and restores to parallelize naturally throughout the cluster.
This DaemonSet strategy was chosen over Kubernetes native checkpoint/restore assist in runc for three causes: it’s absolutely moveable with out relying on cloud-provider characteristic gates, it provides tighter management over CRIU for efficiency tuning, and it permits checkpoint artifacts to reside in versatile storage backends relatively than being embedded into OCI photos.
Quiesce/resume hooks: A Dynamo inference employee initializes in two ordered phases. First, engine initialization: communicators are initialized, weights are loaded, kernels are warmed up, and CUDA graphs are compiled. The employee is absolutely heat at this level however not but discoverable exterior its pod. Second, distributed runtime startup: the employee connects to the Dynamo management airplane and registers with the invention backend. Open TCP connections to the management airplane exist from this level onward.
If checkpoint had been taken after distributed runtime startup, there could be lively TCP connections that CRIU can’t seize. The resolution is quiesce/resume hooks: the employee writes a ‘prepared for checkpoint’ sign file after engine initialization however earlier than distributed runtime startup. The employee then enters a polling loop ready for a ‘restore full’ sign file whereas the snapshot agent checkpoints it externally. Because CRIU restores execution on the actual instruction the place checkpointing occurred, the employee resumes straight contained in the polling loop, detects the sign file, and proceeds with distributed runtime initialization with out requiring extra synchronization.
The quiesce/resume sample can also be vital for multi-GPU and multi-node checkpoints (deliberate for a future launch): outbound TCP connections used for RPC can’t be checkpointed in a longtime state as a result of the pod IP modifications between checkpoint and restore, and RDMA registrations and NIC state should be recreated post-restore.
Optimization 1: KV Cache Unmap and Release
After measuring peak GPU reminiscence utilization whereas weights, CUDA graphs, and different buffers are allotted, inference engines allocate the remaining GPU reminiscence as a big KV cache buffer. Since the checkpoint is taken earlier than the duplicate has served any requests, this KV cache buffer doesn’t should be checkpointed in any respect. However, its digital deal with should stay secure as a result of it’s baked into the CUDA graph.
The resolution is to allocate the KV cache through the CUDA Virtual Memory Management API (cuMemCreate and cuMemMap), then free the underlying bodily allocation with cuMemUnmap and cuMemRelease — however not cuMemAddressFree. This retains the digital deal with vary intact whereas releasing the bodily reminiscence. This performance is natively accessible in vLLM through sleep() and wake_up() and in SGLang through torch_memory_saver.
For Qwen3-0.6B on a B200, this reduces the entire artifact dimension from ~190 GiB to ~6 GiB. The wins are most pronounced for massive KV cache sizes — that’s, smaller mannequin weights relative to GPU dimension.
Optimization 2: Speeding Up CRIU Memory Restore
Even after the artifact is smaller, upstream CRIU restore time stays a bottleneck. For bigger fashions, restore time really exceeds cold-start time, which negates the advantage of checkpointing.
Note: The CRIU optimizations described beneath usually are not but shipped as a part of Dynamo Snapshot. They could also be accessible as soon as merged into upstream CRIU.
2.1 — Parallel memfd restore: vLLM’s sleep()/wake_up() path and SGLang’s torch_memory_saver transfer weight-tagged GPU allocations into pinned CPU shadow buffers. CUDA backs these allocations with shared nameless reminiscence, pinned via the NVIDIA driver. Inside the Linux kernel, these seem as memfds: nameless, RAM-backed information mapped with MAP_SHARED. For gpt-oss-120b, these buffers consumed greater than 120 GiB, break up throughout many impartial 2 GiB-or-smaller buffers. Upstream CRIU restores these buffers serially. The modified CRIU enumerates all distinctive shmem-backed objects, then makes use of a thread pool to revive them in parallel, permitting restore to make use of accessible storage bandwidth and CPU parallelism.
2.2 — Linux native AIO for nameless reminiscence: In upstream CRIU, the reminiscence restore path is a synchronous preadv loop with precisely one learn in flight at any second, leaving the storage system idle between requests. The substitute makes use of Linux native AIO: CRIU submits a batch of iocbs through io_submit and retains a sliding window of as much as 128 reads in flight concurrently. As completions arrive through io_getevents, new submissions backfill the window.
Where the storage backend helps it, each nameless and shared reminiscence reads use O_DIRECT, avoiding pointless web page cache stress in the course of the one-pass restore stream. Linux native AIO is just really asynchronous on information opened with O_DIRECT. On filesystems the place O_DIRECT is unavailable — comparable to some NFS deployments — restore falls again to buffered I/O with sequential readahead, and the good points from AIO are considerably decreased.
Combined outcomes throughout three fashions (checkpoint sizes after KV cache unmap):
| Model | Checkpoint Size | CRIU (upstream) | CRIU (AIO) | CRIU (AIO + parallel memfd) | Speedup | SOL* |
|---|---|---|---|---|---|---|
| Qwen3-0.6B | 6.2 GiB | 6.8 s | 2.9 s | 2.4 s | 2.8× | 0.95 s |
| Qwen3-8B | 26 GiB | 24 s | 11 s | 4.7 s | 5.1× | 1.8 s |
| gpt-oss-120b | 129 GiB | 119 s | 54 s | 15 s | 7.9× | 11 s |
*SOL (velocity of sunshine) is the theoretical most restore velocity given accessible storage bandwidth — the ground beneath which restore time can’t go.
At this level CRIU restore time is near SOL, however end-to-end restore remains to be dominated by transferring massive mannequin weights sequentially from storage via host reminiscence onto the GPU. This is a serial bottleneck: cuda-checkpoint can’t restore GPU reminiscence till CRIU materializes the weights in host reminiscence.
Optimization 3: GPU Memory Service (GMS)
To eradicate the serial weight-transfer bottleneck, NVIDIA’s analysis crew developed the GPU Memory Service (GMS). GMS makes use of the CUDA Virtual Memory Management (VMM) API to decouple massive mannequin weights from the inference employee’s course of lifetime, offloading the vast majority of course of reminiscence right into a separate GMS artifact. By eradicating weights from the core CRIU checkpoint, GMS permits course of state restoration and weight restoration to run concurrently utilizing completely different reminiscence bandwidth channels. Weight restoration can use the quickest accessible paths comparable to GPUDirect Storage (GDS) or peer-GPU RDMA/NVLink.
Checkpoint artifact sizes with GMS:
| Model | CRIU checkpoint (baseline) | CRIU checkpoint (with GMS) | GMS weight artifact |
|---|---|---|---|
| Qwen3-0.6B | 6.2 GiB | 4.3 GiB | 1.2 GiB |
| Qwen3-8B | 26 GiB | 4.8 GiB | 15 GiB |
| gpt-oss-120b | 129 GiB | 6.7 GiB | 74 GiB |
In a proof-of-concept weight restoration backend that stripes weights throughout 8 native NVMe SSDs, weight restoration completes in parallel with CRIU course of restore — bringing whole end-to-end startup time for gpt-oss-120b underneath 5 seconds, a 21× discount. Restore instances are measured from a typical restore set off timestamp, excluding container startup time.
Deployment: Kubernetes Resources
The deployment workflow makes use of three Kubernetes sources. The snapshot-agent DaemonSet is put in through Helm chart. The DynamoCheckpoint customized useful resource (shortname: dckpt) defines which mannequin configuration to checkpoint. The DynamoGraphDeployment CR references the checkpoint for restore.
Prerequisites from the documentation: x86_64 (amd64) GPU nodes; NVIDIA driver 580.xx or newer on GPU nodes (590.xx or newer for multi-GPU snapshots); ReadWriteMany storage for cross-node restore; present backend assist is vLLM solely, in restricted preview.
The DynamoCheckpoint id is a 16-character SHA256 hash of fields that have an effect on runtime state: mannequin, backendFramework, dynamoVersion, tensorParallelSize, pipelineParallelSize, dtype, maxModelLen, and extraParameters. Fields that don’t have an effect on the hash embody duplicate rely, node placement, useful resource limits, and observability configuration.
Two deployment modes exist. The specific checkpointRef mode references a prepared DynamoCheckpoint by identify. Auto mode has the operator compute the id hash, look for an identical DynamoCheckpoint, and create one solely when no match exists — the primary employee cold-starts and the checkpoint is created within the background for subsequent scale occasions.
Current limitations: checkpoint/restore helps vLLM employees solely in restricted preview; specialised employees (multimodal, embedding, diffusion) usually are not supported; multi-GPU tensor-parallel configurations have restricted validation; GMS restore isn’t but accessible; snapshot-agent should run privileged; and restore is delicate to reside TCP socket state.
Key Takeaways
- Dynamo Snapshot makes use of CRIU and cuda-checkpoint to freeze and restore single-GPU inference employees on Kubernetes, avoiding full cold-start latency.
- KV cache unmap through
cuMemUnmapandcuMemReleasereduces checkpoint artifact dimension from ~190 GiB to ~6 GiB for Qwen3-0.6B on a B200. - Linux native AIO and parallel memfd restore lower CRIU restore time by as much as 7.9× over upstream CRIU; these optimizations are pending upstream CRIU merge.
- The GPU Memory Service (GMS) decouples mannequin weights from the CRIU artifact, enabling concurrent course of and weight restoration over channels like GPUDirect Storage.
- In a proof-of-concept utilizing 8 striped native NVMe SSDs, gpt-oss-120b startup time is decreased by 21× to underneath 5 seconds.
Marktechpost’s Visual Explainer
Dynamo Snapshot — Kubernetes Inference Guide
Check out the Technical details. Also, be happy to comply with us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The put up NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes appeared first on MarkTechPost.
