A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing
In this tutorial, we discover kvcached, a dynamic KV-cache implementation on high of vLLM, to know how dynamic KV-cache allocation transforms GPU reminiscence utilization for giant language fashions. We start by establishing the surroundings and deploying light-weight Qwen2.5 fashions by way of an OpenAI-compatible API, making certain a sensible inference workflow. We then design managed experiments the place we simulate bursty workloads to watch how reminiscence behaves underneath each elastic and static allocation methods. Through systematic measurement and visualization, we straight examine VRAM utilization and latency, and prolong the setup to a multi-model situation the place we observe how reminiscence flexibly shifts throughout energetic workloads in actual time.
import os, sys, time, json, subprocess, threading, sign, shutil
from pathlib import Path
def sh(cmd, test=True):
return subprocess.run(cmd, test=test, shell=isinstance(cmd, str))
strive:
import torch
besides ImportError:
sh([sys.executable, "-m", "pip", "install", "-q", "torch"])
import torch
assert torch.cuda.is_available(),
"No GPU detected. In Colab: Runtime > Change runtime sort > GPU."
props = torch.cuda.get_device_properties(0)
print(f"[GPU] {torch.cuda.get_device_name(0)} "
f"({props.total_memory / 1e9:.1f} GB, "
f"compute functionality {props.main}.{props.minor})")
def pip_install(*pkgs, additional=()):
subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs, *extra],
test=True)
print("[install] vLLM ...")
pip_install("vllm==0.10.2")
print("[install] kvcached (compiles a small CUDA extension) ...")
pip_install("kvcached", additional=["--no-build-isolation"])
print("[install] misc (matplotlib, requests, pynvml) ...")
pip_install("matplotlib", "requests", "pynvml", "numpy")
MODEL_A = "Qwen/Qwen2.5-0.5B-Instruct"
MODEL_B = "Qwen/Qwen2.5-1.5B-Instruct"
PORT_A, PORT_B = 8001, 8002
MAX_MODEL_LEN = 2048
We begin by establishing the surroundings and verifying {that a} GPU is accessible for our experiments. We set up all required dependencies together with vLLM and kvcached together with supporting libraries. We then outline our mannequin configurations and ports to arrange for launching the inference servers.
def launch_vllm(mannequin, port, kvcached=True, gpu_mem_util=0.55, log_path=None):
"""Start a vLLM OpenAI-compatible server as a subprocess. With kvcached=True
the autopatch hooks substitute vLLM's KV-cache allocator with the elastic one."""
env = os.environ.copy()
env["VLLM_USE_V1"] = "1"
if kvcached:
env["ENABLE_KVCACHED"] = "true"
env["KVCACHED_AUTOPATCH"] = "1"
env["KVCACHED_IPC_NAME"] = f"kvc_{port}"
cmd = [
sys.executable, "-m", "vllm.entrypoints.openai.api_server",
"--model", model, "--port", str(port),
"--max-model-len", str(MAX_MODEL_LEN),
"--disable-log-requests",
"--no-enable-prefix-caching",
"--enforce-eager",
]
if not kvcached:
cmd += ["--gpu-memory-utilization", str(gpu_mem_util)]
log = open(log_path or os.devnull, "w")
proc = subprocess.Popen(cmd, env=env, stdout=log, stderr=subprocess.STDOUT,
preexec_fn=os.setsid)
return proc, log
def wait_ready(port, timeout=420):
import requests
url = f"http://localhost:{port}/v1/fashions"
t0 = time.time()
whereas time.time() - t0 < timeout:
strive:
if requests.get(url, timeout=2).status_code == 200:
return True
besides Exception:
move
time.sleep(3)
elevate TimeoutError(f"vLLM on port {port} did not come up inside {timeout}s")
def shutdown(proc, log):
if proc and proc.ballot() is None:
strive:
os.killpg(os.getpgid(proc.pid), sign.SIGTERM)
proc.wait(timeout=45)
besides Exception:
os.killpg(os.getpgid(proc.pid), sign.SIGKILL)
if log and not log.closed:
log.shut()
time.sleep(3)
We implement helper features to launch and handle the vLLM server with and with out kvcached enabled. We configure surroundings variables to activate dynamic KV-cache conduct and guarantee correct server initialization. We additionally outline utilities to attend for server readiness and safely shut down processes after execution.
import pynvml
pynvml.nvmlInit()
NV_HANDLE = pynvml.nvmlDeviceGetHandleByIndex(0)
def vram_used_mb():
information = pynvml.nvmlDeviceGetMemoryInfo(NV_HANDLE)
return information.used / (1024 ** 2)
class MemorySampler(threading.Thread):
def __init__(self, interval=0.2):
tremendous().__init__(daemon=True)
self.interval = interval
self.samples = []
self._stop = threading.Event()
def run(self):
t0 = time.time()
whereas not self._stop.is_set():
self.samples.append((time.time() - t0, vram_used_mb()))
time.sleep(self.interval)
def cease(self):
self._stop.set(); self.be part of()
import requests
from concurrent.futures import ThreadPoolExecutor
PROMPTS = [
"Explain quantum entanglement to a curious 10-year-old.",
"Write a Python function that detects cycles in a linked list.",
"Summarize the plot of Hamlet in one paragraph.",
"List 5 surprising household uses for baking soda with explanations.",
"Compose a vivid haiku about rainy Monday mornings.",
"Describe the Fermi paradox and three plausible resolutions.",
"Translate 'knowledge is power' into French, German, and Japanese.",
"Explain the difference between TCP and UDP with real examples.",
]
def bursty_workload(port, mannequin, n_bursts=3, burst_size=6, pause=6.0,
max_tokens=180):
"""Fire n_bursts waves of burst_size concurrent requests with an idle
hole between waves. The idle hole is the place kvcached releases bodily
VRAM -- a static-allocation engine merely can not."""
url = f"http://localhost:{port}/v1/chat/completions"
def one(i):
physique = {
"mannequin": mannequin,
"messages": [{"role": "user", "content": PROMPTS[i % len(PROMPTS)]}],
"max_tokens": max_tokens, "temperature": 0.7,
}
t0 = time.time()
r = requests.publish(url, json=physique, timeout=180)
r.raise_for_status()
return time.time() - t0
latencies = []
with ThreadPoolExecutor(max_workers=burst_size) as ex:
for b in vary(n_bursts):
print(f" burst {b+1}/{n_bursts} ({burst_size} concurrent)")
latencies += record(ex.map(one, vary(burst_size)))
if b < n_bursts - 1:
time.sleep(pause)
return latencies
We initialize GPU reminiscence monitoring utilizing pynvml to watch VRAM utilization in actual time. We create a background sampling thread that repeatedly data reminiscence consumption throughout experiments. We then outline a bursty workload generator that sends concurrent requests to simulate lifelike LLM utilization patterns.
print("n=== Experiment 1: vLLM + kvcached ===")
proc, log = launch_vllm(MODEL_A, PORT_A, kvcached=True,
log_path="/tmp/vllm_kvc.log")
strive:
wait_ready(PORT_A)
idle_kvc = vram_used_mb()
print(f" Idle VRAM after load (weights solely): {idle_kvc:.0f} MB")
sampler = MemorySampler(); sampler.begin()
lat_kvc = bursty_workload(PORT_A, MODEL_A)
time.sleep(6)
sampler.cease()
mem_kvc = sampler.samples
lastly:
shutdown(proc, log)
print("n=== Experiment 2: vLLM baseline (static KV allocation) ===")
proc, log = launch_vllm(MODEL_A, PORT_A, kvcached=False,
log_path="/tmp/vllm_base.log")
strive:
wait_ready(PORT_A)
idle_base = vram_used_mb()
print(f" Idle VRAM (weights + pre-reserved KV pool): {idle_base:.0f} MB")
sampler = MemorySampler(); sampler.begin()
lat_base = bursty_workload(PORT_A, MODEL_A)
time.sleep(6)
sampler.cease()
mem_base = sampler.samples
lastly:
shutdown(proc, log)
We run the primary experiment with kvcached enabled and seize each reminiscence utilization and latency metrics. We then execute the identical workload underneath a baseline static allocation setup for comparability. We gather and retailer all outcomes to allow a transparent side-by-side analysis of each approaches.
import numpy as np
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))
tk, mk = zip(*mem_kvc); tb, mb = zip(*mem_base)
axes[0].plot(tk, mk, label="with kvcached", linewidth=2, colour="#1f77b4")
axes[0].plot(tb, mb, label="baseline (static)", linewidth=2,
linestyle="--", colour="#d62728")
axes[0].axhline(idle_kvc, colour="#1f77b4", alpha=.3, linestyle=":")
axes[0].axhline(idle_base, colour="#d62728", alpha=.3, linestyle=":")
axes[0].set_xlabel("time (s)"); axes[0].set_ylabel("GPU reminiscence used (MB)")
axes[0].set_title("VRAM underneath a bursty workloadn(dotted = idle-baseline VRAM)")
axes[0].grid(alpha=.3); axes[0].legend()
axes[1].boxplot([lat_kvc, lat_base], labels=["kvcached", "baseline"])
axes[1].set_ylabel("request latency (s)")
axes[1].set_title(f"Latency throughout {len(lat_kvc)} requests")
axes[1].grid(alpha=.3)
plt.tight_layout()
plt.savefig("/content material/kvcached_single_model.png", dpi=120, bbox_inches="tight")
plt.present()
print("n--- Single-model abstract --------------------------------------------")
print(f" Idle VRAM kvcached: {idle_kvc:>6.0f} MB "
f"baseline: {idle_base:>6.0f} MB "
f"(financial savings: {idle_base - idle_kvc:>5.0f} MB)")
print(f" Peak VRAM kvcached: {max(mk):>6.0f} MB "
f"baseline: {max(mb):>6.0f} MB")
print(f" Median lat. kvcached: {np.median(lat_kvc):>6.2f} s "
f"baseline: {np.median(lat_base):>6.2f} s")
print(f" VRAM flex kvcached: peak-idle = {max(mk)-min(mk):>5.0f} MB "
f"(baseline cannot launch -- static pool)")
print("n=== Experiment 3: Two LLMs sharing one GPU (kvcached on each) ===")
pA, lA = launch_vllm(MODEL_A, PORT_A, kvcached=True, log_path="/tmp/mA.log")
strive:
wait_ready(PORT_A)
pB, lB = launch_vllm(MODEL_B, PORT_B, kvcached=True, log_path="/tmp/mB.log")
strive:
wait_ready(PORT_B)
print(f" Both fashions loaded. Idle VRAM: {vram_used_mb():.0f} MB")
sampler = MemorySampler(); sampler.begin()
for i in vary(4):
port, mannequin = ((PORT_A, MODEL_A) if i % 2 == 0
else (PORT_B, MODEL_B))
print(f" spherical {i+1}: driving {mannequin}")
bursty_workload(port, mannequin, n_bursts=1, burst_size=4, pause=0)
time.sleep(5)
sampler.cease()
t, m = zip(*sampler.samples)
plt.determine(figsize=(11, 4.2))
plt.plot(t, m, colour="#c2410c", linewidth=2)
plt.xlabel("time (s)"); plt.ylabel("GPU reminiscence used (MB)")
plt.title("Two LLMs on one T4 through kvcached — reminiscence flexes per energetic mannequin")
plt.grid(alpha=.3); plt.tight_layout()
plt.savefig("/content material/kvcached_multillm.png", dpi=120,
bbox_inches="tight")
plt.present()
lastly:
shutdown(pB, lB)
lastly:
shutdown(pA, lA)
print("n=== Bonus: kvcached ships CLI instruments ===")
print(" kvtop — dwell per-instance KV reminiscence monitor (like nvtop for kvcached)")
print(" kvctl — set/restrict per-instance reminiscence budgets in shared reminiscence")
for software in ("kvtop", "kvctl"):
path = shutil.which(software)
print(f" {software}: {path or 'not on PATH'}")
print("nAll plots saved to /content material/. Done.")
We visualize the collected knowledge by plotting VRAM utilization tendencies and latency distributions throughout each setups. We compute abstract statistics to quantify enhancements in reminiscence effectivity and efficiency. We lastly prolong the experiment to a multi-model situation, observe how reminiscence dynamically adapts throughout energetic fashions, and conclude with extra insights into tooling.
In conclusion, we demonstrated how dynamic KV-cache administration essentially improves GPU effectivity in comparison with conventional static allocation approaches. We noticed that kvcached allows vital VRAM financial savings throughout idle durations whereas sustaining aggressive latency underneath load, making it particularly efficient for bursty or multi-tenant inference environments. By operating a number of fashions on a single GPU and alternating visitors, we clearly noticed how reminiscence is allotted solely when wanted and launched when idle, validating the core premise of demand-driven caching. Overall, we established a sensible and reproducible framework for evaluating reminiscence optimization strategies in LLM serving and highlighted how this method can scale to extra complicated, production-grade deployments.
Check out the Full Codes with Notebook. Also, be at liberty to observe us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The publish A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing appeared first on MarkTechPost.
