|

A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing

In this tutorial, we discover kvcached, a dynamic KV-cache implementation on high of vLLM, to know how dynamic KV-cache allocation transforms GPU reminiscence utilization for giant language fashions. We start by establishing the surroundings and deploying light-weight Qwen2.5 fashions by way of an OpenAI-compatible API, making certain a sensible inference workflow. We then design managed experiments the place we simulate bursty workloads to watch how reminiscence behaves underneath each elastic and static allocation methods. Through systematic measurement and visualization, we straight examine VRAM utilization and latency, and prolong the setup to a multi-model situation the place we observe how reminiscence flexibly shifts throughout energetic workloads in actual time.

import os, sys, time, json, subprocess, threading, sign, shutil
from pathlib import Path


def sh(cmd, test=True):
   return subprocess.run(cmd, test=test, shell=isinstance(cmd, str))


strive:
   import torch
besides ImportError:
   sh([sys.executable, "-m", "pip", "install", "-q", "torch"])
   import torch


assert torch.cuda.is_available(), 
   "No GPU detected. In Colab: Runtime > Change runtime sort > GPU."
props = torch.cuda.get_device_properties(0)
print(f"[GPU] {torch.cuda.get_device_name(0)}  "
     f"({props.total_memory / 1e9:.1f} GB, "
     f"compute functionality {props.main}.{props.minor})")


def pip_install(*pkgs, additional=()):
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs, *extra],
                  test=True)


print("[install] vLLM ...")
pip_install("vllm==0.10.2")
print("[install] kvcached (compiles a small CUDA extension) ...")
pip_install("kvcached", additional=["--no-build-isolation"])
print("[install] misc (matplotlib, requests, pynvml) ...")
pip_install("matplotlib", "requests", "pynvml", "numpy")


MODEL_A = "Qwen/Qwen2.5-0.5B-Instruct"
MODEL_B = "Qwen/Qwen2.5-1.5B-Instruct"
PORT_A, PORT_B = 8001, 8002
MAX_MODEL_LEN = 2048

We begin by establishing the surroundings and verifying {that a} GPU is accessible for our experiments. We set up all required dependencies together with vLLM and kvcached together with supporting libraries. We then outline our mannequin configurations and ports to arrange for launching the inference servers.

def launch_vllm(mannequin, port, kvcached=True, gpu_mem_util=0.55, log_path=None):
   """Start a vLLM OpenAI-compatible server as a subprocess. With kvcached=True
   the autopatch hooks substitute vLLM's KV-cache allocator with the elastic one."""
   env = os.environ.copy()
   env["VLLM_USE_V1"] = "1"
   if kvcached:
       env["ENABLE_KVCACHED"]    = "true"
       env["KVCACHED_AUTOPATCH"] = "1"
       env["KVCACHED_IPC_NAME"]  = f"kvc_{port}"
   cmd = [
       sys.executable, "-m", "vllm.entrypoints.openai.api_server",
       "--model", model, "--port", str(port),
       "--max-model-len", str(MAX_MODEL_LEN),
       "--disable-log-requests",
       "--no-enable-prefix-caching",
       "--enforce-eager",
   ]
   if not kvcached:
       cmd += ["--gpu-memory-utilization", str(gpu_mem_util)]
   log = open(log_path or os.devnull, "w")
   proc = subprocess.Popen(cmd, env=env, stdout=log, stderr=subprocess.STDOUT,
                           preexec_fn=os.setsid)
   return proc, log


def wait_ready(port, timeout=420):
   import requests
   url = f"http://localhost:{port}/v1/fashions"
   t0 = time.time()
   whereas time.time() - t0 < timeout:
       strive:
           if requests.get(url, timeout=2).status_code == 200:
               return True
       besides Exception:
           move
       time.sleep(3)
   elevate TimeoutError(f"vLLM on port {port} did not come up inside {timeout}s")


def shutdown(proc, log):
   if proc and proc.ballot() is None:
       strive:
           os.killpg(os.getpgid(proc.pid), sign.SIGTERM)
           proc.wait(timeout=45)
       besides Exception:
           os.killpg(os.getpgid(proc.pid), sign.SIGKILL)
   if log and not log.closed:
       log.shut()
   time.sleep(3)

We implement helper features to launch and handle the vLLM server with and with out kvcached enabled. We configure surroundings variables to activate dynamic KV-cache conduct and guarantee correct server initialization. We additionally outline utilities to attend for server readiness and safely shut down processes after execution.

import pynvml
pynvml.nvmlInit()
NV_HANDLE = pynvml.nvmlDeviceGetHandleByIndex(0)


def vram_used_mb():
   information = pynvml.nvmlDeviceGetMemoryInfo(NV_HANDLE)
   return information.used / (1024 ** 2)


class MemorySampler(threading.Thread):
   def __init__(self, interval=0.2):
       tremendous().__init__(daemon=True)
       self.interval = interval
       self.samples  = []
       self._stop    = threading.Event()
   def run(self):
       t0 = time.time()
       whereas not self._stop.is_set():
           self.samples.append((time.time() - t0, vram_used_mb()))
           time.sleep(self.interval)
   def cease(self):
       self._stop.set(); self.be part of()


import requests
from concurrent.futures import ThreadPoolExecutor


PROMPTS = [
   "Explain quantum entanglement to a curious 10-year-old.",
   "Write a Python function that detects cycles in a linked list.",
   "Summarize the plot of Hamlet in one paragraph.",
   "List 5 surprising household uses for baking soda with explanations.",
   "Compose a vivid haiku about rainy Monday mornings.",
   "Describe the Fermi paradox and three plausible resolutions.",
   "Translate 'knowledge is power' into French, German, and Japanese.",
   "Explain the difference between TCP and UDP with real examples.",
]


def bursty_workload(port, mannequin, n_bursts=3, burst_size=6, pause=6.0,
                   max_tokens=180):
   """Fire n_bursts waves of burst_size concurrent requests with an idle
   hole between waves. The idle hole is the place kvcached releases bodily
   VRAM -- a static-allocation engine merely can not."""
   url = f"http://localhost:{port}/v1/chat/completions"
   def one(i):
       physique = {
           "mannequin": mannequin,
           "messages": [{"role": "user", "content": PROMPTS[i % len(PROMPTS)]}],
           "max_tokens": max_tokens, "temperature": 0.7,
       }
       t0 = time.time()
       r = requests.publish(url, json=physique, timeout=180)
       r.raise_for_status()
       return time.time() - t0
   latencies = []
   with ThreadPoolExecutor(max_workers=burst_size) as ex:
       for b in vary(n_bursts):
           print(f"    burst {b+1}/{n_bursts}  ({burst_size} concurrent)")
           latencies += record(ex.map(one, vary(burst_size)))
           if b < n_bursts - 1:
               time.sleep(pause)
   return latencies

We initialize GPU reminiscence monitoring utilizing pynvml to watch VRAM utilization in actual time. We create a background sampling thread that repeatedly data reminiscence consumption throughout experiments. We then outline a bursty workload generator that sends concurrent requests to simulate lifelike LLM utilization patterns.

print("n=== Experiment 1: vLLM + kvcached ===")
proc, log = launch_vllm(MODEL_A, PORT_A, kvcached=True,
                       log_path="/tmp/vllm_kvc.log")
strive:
   wait_ready(PORT_A)
   idle_kvc = vram_used_mb()
   print(f"  Idle VRAM after load (weights solely): {idle_kvc:.0f} MB")
   sampler = MemorySampler(); sampler.begin()
   lat_kvc = bursty_workload(PORT_A, MODEL_A)
   time.sleep(6)
   sampler.cease()
   mem_kvc = sampler.samples
lastly:
   shutdown(proc, log)


print("n=== Experiment 2: vLLM baseline (static KV allocation) ===")
proc, log = launch_vllm(MODEL_A, PORT_A, kvcached=False,
                       log_path="/tmp/vllm_base.log")
strive:
   wait_ready(PORT_A)
   idle_base = vram_used_mb()
   print(f"  Idle VRAM (weights + pre-reserved KV pool): {idle_base:.0f} MB")
   sampler = MemorySampler(); sampler.begin()
   lat_base = bursty_workload(PORT_A, MODEL_A)
   time.sleep(6)
   sampler.cease()
   mem_base = sampler.samples
lastly:
   shutdown(proc, log)

We run the primary experiment with kvcached enabled and seize each reminiscence utilization and latency metrics. We then execute the identical workload underneath a baseline static allocation setup for comparability. We gather and retailer all outcomes to allow a transparent side-by-side analysis of each approaches.

import numpy as np
import matplotlib.pyplot as plt


fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))


tk, mk = zip(*mem_kvc); tb, mb = zip(*mem_base)
axes[0].plot(tk, mk, label="with kvcached", linewidth=2, colour="#1f77b4")
axes[0].plot(tb, mb, label="baseline (static)", linewidth=2,
            linestyle="--", colour="#d62728")
axes[0].axhline(idle_kvc,  colour="#1f77b4", alpha=.3, linestyle=":")
axes[0].axhline(idle_base, colour="#d62728", alpha=.3, linestyle=":")
axes[0].set_xlabel("time (s)"); axes[0].set_ylabel("GPU reminiscence used (MB)")
axes[0].set_title("VRAM underneath a bursty workloadn(dotted = idle-baseline VRAM)")
axes[0].grid(alpha=.3); axes[0].legend()


axes[1].boxplot([lat_kvc, lat_base], labels=["kvcached", "baseline"])
axes[1].set_ylabel("request latency (s)")
axes[1].set_title(f"Latency throughout {len(lat_kvc)} requests")
axes[1].grid(alpha=.3)


plt.tight_layout()
plt.savefig("/content material/kvcached_single_model.png", dpi=120, bbox_inches="tight")
plt.present()


print("n--- Single-model abstract --------------------------------------------")
print(f"  Idle VRAM    kvcached: {idle_kvc:>6.0f} MB   "
     f"baseline: {idle_base:>6.0f} MB  "
     f"(financial savings: {idle_base - idle_kvc:>5.0f} MB)")
print(f"  Peak VRAM    kvcached: {max(mk):>6.0f} MB   "
     f"baseline: {max(mb):>6.0f} MB")
print(f"  Median lat.  kvcached: {np.median(lat_kvc):>6.2f} s   "
     f"baseline: {np.median(lat_base):>6.2f} s")
print(f"  VRAM flex    kvcached: peak-idle = {max(mk)-min(mk):>5.0f} MB  "
     f"(baseline cannot launch -- static pool)")


print("n=== Experiment 3: Two LLMs sharing one GPU (kvcached on each) ===")
pA, lA = launch_vllm(MODEL_A, PORT_A, kvcached=True, log_path="/tmp/mA.log")
strive:
   wait_ready(PORT_A)
   pB, lB = launch_vllm(MODEL_B, PORT_B, kvcached=True, log_path="/tmp/mB.log")
   strive:
       wait_ready(PORT_B)
       print(f"  Both fashions loaded. Idle VRAM: {vram_used_mb():.0f} MB")


       sampler = MemorySampler(); sampler.begin()
       for i in vary(4):
           port, mannequin = ((PORT_A, MODEL_A) if i % 2 == 0
                          else (PORT_B, MODEL_B))
           print(f"  spherical {i+1}: driving {mannequin}")
           bursty_workload(port, mannequin, n_bursts=1, burst_size=4, pause=0)
           time.sleep(5)
       sampler.cease()
       t, m = zip(*sampler.samples)


       plt.determine(figsize=(11, 4.2))
       plt.plot(t, m, colour="#c2410c", linewidth=2)
       plt.xlabel("time (s)"); plt.ylabel("GPU reminiscence used (MB)")
       plt.title("Two LLMs on one T4 through kvcached — reminiscence flexes per energetic mannequin")
       plt.grid(alpha=.3); plt.tight_layout()
       plt.savefig("/content material/kvcached_multillm.png", dpi=120,
                   bbox_inches="tight")
       plt.present()
   lastly:
       shutdown(pB, lB)
lastly:
   shutdown(pA, lA)


print("n=== Bonus: kvcached ships CLI instruments ===")
print("  kvtop  — dwell per-instance KV reminiscence monitor (like nvtop for kvcached)")
print("  kvctl  — set/restrict per-instance reminiscence budgets in shared reminiscence")
for software in ("kvtop", "kvctl"):
   path = shutil.which(software)
   print(f"    {software}: {path or 'not on PATH'}")
print("nAll plots saved to /content material/. Done.")

We visualize the collected knowledge by plotting VRAM utilization tendencies and latency distributions throughout each setups. We compute abstract statistics to quantify enhancements in reminiscence effectivity and efficiency. We lastly prolong the experiment to a multi-model situation, observe how reminiscence dynamically adapts throughout energetic fashions, and conclude with extra insights into tooling.

In conclusion, we demonstrated how dynamic KV-cache administration essentially improves GPU effectivity in comparison with conventional static allocation approaches. We noticed that kvcached allows vital VRAM financial savings throughout idle durations whereas sustaining aggressive latency underneath load, making it particularly efficient for bursty or multi-tenant inference environments. By operating a number of fashions on a single GPU and alternating visitors, we clearly noticed how reminiscence is allotted solely when wanted and launched when idle, validating the core premise of demand-driven caching. Overall, we established a sensible and reproducible framework for evaluating reminiscence optimization strategies in LLM serving and highlighted how this method can scale to extra complicated, production-grade deployments.


Check out the Full Codes with Notebook. Also, be at liberty to observe us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing appeared first on MarkTechPost.

Similar Posts