Meet ‘kvcached’: A Machine Learning Library to Enable Virtualized, Elastic KV Cache for LLM Serving on Shared GPUs

Large language mannequin serving usually wastes GPU reminiscence as a result of engines pre-reserve massive static KV cache areas per mannequin, even when requests are bursty or idle. Meet ‘kvcached‘, a library to allow virtualized, elastic KV cache for LLM serving on shared GPUs. kvcached has been developed by a analysis from Berkeley’s Sky Computing Lab (University of California, Berkeley) in shut collaboration with Rice University and UCLA, and with beneficial enter from collaborators and colleagues at NVIDIA, Intel Corporation, Stanford University. It introduces an OS-style digital reminiscence abstraction for the KV cache that lets serving engines reserve contiguous digital house first, then again solely the lively parts with bodily GPU pages on demand. This decoupling raises reminiscence utilization, reduces chilly begins, and allows a number of fashions to time share and house share a tool with out heavy engine rewrites.

What kvcached adjustments?

With kvcached, an engine creates a KV cache pool that’s contiguous within the digital deal with house. As tokens arrive, the library maps bodily GPU pages lazily at a advantageous granularity utilizing CUDA digital reminiscence APIs. When requests full or fashions go idle, pages unmap and return to a shared pool, which different colocated fashions can instantly reuse. This preserves easy pointer arithmetic in kernels, and removes the necessity for per engine consumer stage paging. The venture targets SGLang and vLLM integration, and it’s launched below the Apache 2.0 license. Installation and a one command fast begin are documented within the Git repository.

https://yifanqiao.notion.website/Solve-the-GPU-Cost-Crisis-with-kvcached-289da9d1f4d68034b17bf2774201b141

How does it impression at scale?

Production workloads host many fashions with lengthy tail visitors and spiky bursts. Static reservations depart reminiscence stranded and decelerate time to first token when fashions have to be activated or swapped. The Prism research paper reveals that multi-LLM serving requires cross mannequin reminiscence coordination at runtime, not simply compute scheduling. Prism implements on demand mapping of bodily to digital pages and a two stage scheduler, and stories greater than 2 occasions value financial savings and 3.3 occasions greater TTFT SLO attainment versus prior programs on actual traces. kvcached focuses on the reminiscence coordination primitive, and offers a reusable part that brings this functionality to mainstream engines.

Performance indicators

The kvcached group stories 1.2 occasions to 28 occasions sooner time to first token in multi mannequin serving, due to instant reuse of freed pages and the elimination of enormous static allocations. These numbers come from multi-LLM situations the place activation latency and reminiscence headroom dominate tail latency. The analysis group word kvcached’s compatibility with SGLang and vLLM, and describe elastic KV allocation because the core mechanism.

Recent work has moved from fastened partitioning to digital reminiscence primarily based strategies for KV administration. Prism extends VMM primarily based allocation to multi-LLM settings with cross mannequin coordination and scheduling. Prior efforts like vAttention discover CUDA VMM for single mannequin serving to keep away from fragmentation with out PagedAttention. The arc is obvious, use digital reminiscence to hold KV contiguous in digital house, then map bodily pages elastically because the workload evolves. kvcached operationalizes this concept as a library, which simplifies adoption inside current engines.

Practical Applications for Devs

Colocation throughout fashions: Engines can colocate a number of small or medium fashions on one gadget. When one mannequin goes idle, its KV pages free rapidly and one other mannequin can broaden its working set with out restart. This reduces head of line blocking throughout bursts and improves TTFT SLO attainment.

Activation habits: Prism stories activation occasions of about 0.7 seconds for an 8B mannequin and about 1.5 seconds for a 70B mannequin with streaming activation. kvcached advantages from comparable rules as a result of digital reservations enable engines to put together deal with ranges prematurely, then map pages as tokens arrive.

Autoscaling for serverless LLM: Fine grained web page mapping makes it possible to scale replicas extra continuously and to run chilly fashions in a heat state with minimal reminiscence footprint. This allows tighter autoscaling loops and reduces the blast radius of scorching spots.

Offloading and future work. Virtual reminiscence opens the door to KV offload to host reminiscence or NVMe when the entry sample permits it. NVIDIA’s recent guide on managed reminiscence for KV offload on GH200 class programs reveals how unified deal with areas can prolong capability at acceptable overheads. The kvcached maintainers additionally talk about offload and compaction instructions in public threads. Verify throughput and latency in your personal pipeline, since entry locality and PCIe topology have sturdy results.

Key Takeaways

kvcached virtualizes the KV cache utilizing GPU digital reminiscence, engines reserve contiguous digital house and map bodily pages on demand, enabling elastic allocation and reclamation below dynamic masses.
It integrates with mainstream inference engines, particularly SGLang and vLLM, and is launched below Apache 2.0, making adoption and modification simple for manufacturing serving stacks.
Public benchmarks report 1.2 occasions to 28 occasions sooner time to first token in multi mannequin serving due to instant reuse of freed KV pages and the elimination of enormous static reservations.
Prism reveals that cross mannequin reminiscence coordination, applied by way of on demand mapping and two stage scheduling, delivers greater than 2 occasions value financial savings and three.3 occasions greater TTFT SLO attainment on actual traces, kvcached provides the reminiscence primitive that mainstream engines can reuse.
For clusters that host many fashions with bursty, lengthy tail visitors, virtualized KV cache permits secure colocation, sooner activation, and tighter autoscaling, with reported activation round 0.7 seconds for an 8B mannequin and 1.5 seconds for a 70B mannequin within the Prism analysis.

Editorial Comments

kvcached is an efficient strategy towards GPU reminiscence virtualization for LLM serving, not a full working system, and that readability issues. The library reserves digital deal with house for the KV cache, then maps bodily pages on demand, which allows elastic sharing throughout fashions with minimal engine adjustments. This aligns with proof that cross mannequin reminiscence coordination is important for multi mannequin workloads and improves SLO attainment and price below actual traces. Overall, kvcached advances GPU reminiscence coordination for LLM serving, manufacturing worth relies upon on per cluster validation.

Check out the GitHub Repo, Paper 1, Paper 2 and Technical details. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up Meet ‘kvcached’: A Machine Learning Library to Enable Virtualized, Elastic KV Cache for LLM Serving on Shared GPUs appeared first on MarkTechPost.

Meet ‘kvcached’: A Machine Learning Library to Enable Virtualized, Elastic KV Cache for LLM Serving on Shared GPUs

What kvcached adjustments?

How does it impression at scale?

Performance indicators

Practical Applications for Devs

Key Takeaways

Editorial Comments

MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models

Baidu Releases ERNIE-4.5-21B-A3B-Thinking: A Compact MoE Model for Deep Reasoning

Meet OpenTSLM: A Family of Time-Series Language Models (TSLMs) Revolutionizing Medical Time-Series Analysis

This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models

Apple Released FastVLM: A Novel Hybrid Vision Encoder which is 85x Faster and 3.4x Smaller than Comparable Sized Vision Language Models (VLMs)

Meta AI Open-Sources OpenZL: A Format-Aware Compression Framework with a Universal Decoder

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What kvcached adjustments?

How does it impression at scale?

Performance indicators

How is it associated to latest analysis?

Practical Applications for Devs

Key Takeaways

Editorial Comments

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!