Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods
As massive language fashions scale to longer context home windows and serve extra concurrent customers, the key-value (KV) cache has emerged as a major reminiscence bottleneck in manufacturing inference programs. For a 30-billion-parameter mannequin with a batch measurement of 128 and an enter size of 1,024 tokens, the ensuing KV cache can occupy as much as 180 GB of reminiscence. For reference, a 7-billion-parameter mannequin’s parameters devour 14 GB of GPU reminiscence, whereas the KV cache for the identical mannequin can require round 72 GB.
Compressing the KV cache reduces reminiscence strain, will increase batch sizes, and instantly improves throughput with out retraining the bottom mannequin. Over the previous two years, a number of distinct compression methods have emerged from analysis. This article breaks down the ten most necessary ones with emphasis on how every works and the place it matches in a sensible inference pipeline.
Token Eviction with H2O (Heavy Hitter Oracle)
H2O, printed at NeurIPS 2023, is without doubt one of the foundational token eviction strategies. Its core statement is {that a} small portion of tokens contributes the vast majority of consideration rating mass throughout era and are referred to as Heavy Hitters (H2). H2O dynamically retains a steadiness of current tokens and H2 tokens, conserving a set KV cache measurement throughout Transformer layers. The choice course of is pushed by cumulative consideration scores averaged throughout all queries and tokens.
Attention weight distribution follows a power-law which implies evicting low-scoring tokens incurs minimal accuracy loss in follow. H2O is a decoding-phase technique and doesn’t scale back prefill computation, which stays a limitation for long-context prompts. With 20% heavy hitters, H2O improves throughput over Hugging Face Accelerate by as much as 29× on OPT-6.7B and OPT-30B.
StreamingLLM (Attention Sink Retention)
StreamingLLM is designed for eventualities the place LLMs should deal with very lengthy or infinite enter streams. Its technique is to at all times preserve the KV states of the primary few tokens which function consideration sinks, and mix them with a sliding window of the latest tokens as much as the out there reminiscence finances.
The perception is that preliminary tokens, no matter their semantic content material, perform as structural anchors that obtain disproportionately excessive consideration all through era. Dropping them causes important accuracy degradation, whereas preserving them alongside a recency window stabilizes outputs. StreamingLLM is quick and hardware-friendly however doesn’t use significance scoring, which implies it could possibly discard semantically vital middle-context tokens. It is finest suited for streaming dialogue purposes the place current context dominates.
SnapKV (Observation Window Compression)
SnapKV addresses the prefill stage particularly, concentrating on long-prompt eventualities. It makes use of a small statement window on the finish of the immediate to foretell token significance. The consideration scores from queries on this statement window are aggregated to vote for necessary positions — the heavy hitters — within the prefix.
Unlike H2O, SnapKV employs a pooling layer over the statement window’s consideration scores to pick clustered necessary KV positions per consideration head, somewhat than utilizing a flat cumulative significance rating throughout the complete sequence. This head-specific choice makes SnapKV extra correct than H2O on the similar cache finances. SnapKV has develop into a extensively used baseline for prefill-phase compression and is instantly corresponding to H2O on benchmarks comparable to LongBench.
PyramidKV / PyramidInfer (Layer-Wise Pyramidal Allocation)
A key limitation of H2O and SnapKV is that they apply a uniform compression finances throughout all Transformer layers. PyramidKV addresses this by allocating totally different cache sizes per layer based mostly on consideration sample construction. The complementary system, PyramidInfer, extends this to the prefill part itself.
PyramidInfer finds that the variety of essential keys and values that affect future era decreases layer by layer, and extracts them by measuring consistency in consideration weights throughout current tokens. By computing fewer keys and values in deeper layers throughout prefill somewhat than pruning a pre-computed cache, PyramidInfer reduces reminiscence earlier within the pipeline. Experimental outcomes present PyramidInfer improves throughput by 2.2× in comparison with Hugging Face Accelerate, with over 54% GPU reminiscence discount within the KV cache.
The instinct aligns with how data funnels by way of Transformer depth: early layers want richer context, whereas deeper layers converge on a smaller set of salient tokens. Assigning compression budgets proportionally to every layer’s precise data density is extra environment friendly than making use of a flat finances uniformly.
KV Cache Quantization — KIVI
KIVI, printed at ICML 2024, is a plug-and-play 2-bit KV cache quantization algorithm that requires no fine-tuning. It quantizes the important thing cache per-channel and the worth cache per-token.
The uneven scheme is motivated by noticed distributional variations: keys exhibit bigger channel-wise outliers, whereas values are higher represented per-token. With this hardware-friendly design, KIVI allows fashions together with Llama-2, Falcon, and Mistral to keep up comparable era high quality whereas lowering mixed peak reminiscence, mannequin weights and KV cache, by 2.6×. This allows as much as 4× bigger batch sizes and will increase throughput by 2.35× to three.47× on actual inference workloads. The 2.6× determine covers each mannequin weights and KV cache collectively: at 2-bit precision the KV cache discount is extra aggressive, and it’s this discount that drives the batch measurement scaling.
KVQuant (Calibrated Mixed-Precision Quantization)
While KIVI applies a set uneven scheme, KVQuant takes a calibrated, multi-component strategy to low-bit KV cache quantization. It combines per-channel key quantization, pre-RoPE key quantization (which avoids quantizing keys after positional embeddings have distorted the distribution), sensitivity-weighted non-uniform quantization that defines quantization ranges from calibration information somewhat than mounted grids, and a dense-and-sparse decomposition that handles excessive outlier values individually from the majority distribution.
This mixture permits KVQuant to push quantization to very low bit widths together with sub-4-bit with higher accuracy than fixed-precision schemes, concentrating on deployments that have to help extraordinarily lengthy contexts (the paper evaluates as much as 10 million context size). For manufacturing programs with steady workloads, the calibration value is amortized throughout inference runs.
TurboQuant (Near-Optimal Online KV Cache Quantization)
TurboQuant is Google Research’s newest contribution to this area, accepted at ICLR 2026. It targets a recognized weak spot in all prior quantization strategies: MSE-optimal scalar quantizers introduce systematic bias in internal product estimation, which compounds throughout consideration computations. TurboQuant addresses this by way of a two-stage pipeline.
The first stage, PolarQuant (AISTATS 2026), applies a random orthogonal rotation to every key and worth vector earlier than quantization. This rotation redistributes variance uniformly throughout all coordinates with out altering the mathematical content material so that every coordinate could be quantized precisely with a easy analytically computed scalar quantizer. No coaching or calibration is required. The second stage applies a 1-bit Quantized Johnson-Lindenstrauss (QJL) correction to the quantization residual, which produces an unbiased internal product estimator. Together, the 2 levels obtain at the least 6× reminiscence discount and as much as 8× sooner consideration computation on NVIDIA H100 GPUs at 3-bit precision, working inside an element of roughly 2.7 of the information-theoretic restrict. Because TurboQuant makes use of random matrices somewhat than realized ones, it applies to any mannequin at inference time with no offline preparation.
Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)
MQA and GQA are architectural modifications that scale back the KV cache by design somewhat than compressing an present one. In MQA, all question heads share a single key and worth head, dramatically lowering cache measurement. GQA teams a number of question heads to share a smaller set of key-value heads, providing a center floor between full multi-head consideration and MQA. Both require both coaching from scratch or fine-tuning; with out correct coaching, making use of them to pre-trained fashions sometimes leads to degraded efficiency.
GQA has since develop into the de facto commonplace in fashionable open-weight LLMs. In Llama 2, solely the 70B mannequin used GQA — the 7B and 13B variants used commonplace multi-head consideration. Llama 3 prolonged GQA throughout each the 8B and 70B sizes. Mistral utilized GQA from its preliminary 7B launch in September 2023. For practitioners deciding on or deploying new mannequin households, GQA is now a baseline expectation somewhat than an non-compulsory optimization.
Multi-Head Latent Attention (MLA) — DeepSeek
MLA is DeepSeek’s architectural answer to KV cache reminiscence, first launched in DeepSeek-V2 (May 2024) and carried ahead in DeepSeek-V3 and DeepSeek-R1. It is an consideration mechanism outfitted with low-rank key-value joint compression. Rather than storing full-dimensional key and worth tensors per token, MLA tasks them right into a compressed latent vector throughout inference, storing the latent illustration as a substitute.
The outcomes are essentially the most dramatic of any approach on this checklist. Compared to DeepSeek’s prior 67B dense mannequin, DeepSeek-V2 with MLA reduces the KV cache by 93.3% whereas attaining superior efficiency in comparison with commonplace multi-head consideration. This is just not a marginal enchancment — it essentially modifications the reminiscence economics of serving massive fashions, enabling considerably longer context home windows and bigger batch sizes on the identical {hardware}. Research has additionally proven that MLA persistently affords increased expressive energy than GQA below the identical KV cache finances, offering a theoretical foundation for the empirical beneficial properties. Among architectural approaches, MLA is at present essentially the most validated at scale in open-weight fashions.
Low-Rank KV Cache Compression (Palu / LoRC)
Low-rank compression targets the hidden dimension of KV tensors somewhat than the sequence size or bit width. Palu is a post-training KV cache compression framework that reduces cache measurement by way of low-rank projection of key and worth weight matrices. It proposes a medium-grained, group-head low-rank decomposition that balances accuracy and reconstruction overhead, and makes use of an environment friendly rank search algorithm based mostly on Fisher data to robotically assign bigger ranks to extra delicate weight matrices and smaller ranks to much less vital ones.
Related strategies on this household embrace LoRC, SVDq, CSKV, and ReCalKV, all of which exploit the statement that key and worth matrices throughout consideration heads exhibit important low-rank construction, significantly for longer contexts. Low-rank strategies are orthogonal to each quantization and token eviction and could be stacked with both for compounded compression. This household stays comparatively underexplored in comparison with eviction-based strategies, making it an lively space of analysis.
Key Takeaways:
- KV cache development is proportional to each sequence size and batch measurement, making compression important for high-throughput serving.
- Token eviction (H2O, StreamingLLM, SnapKV) is training-free and hardware-compatible however discards tokens completely; SnapKV selects clustered necessary KV positions per head by way of pooled consideration scores, not flat cumulative scores.
- Quantization (KIVI, KVQuant, TurboQuant) reduces reminiscence with out eradicating tokens. KIVI achieves 2.6× mixed peak reminiscence discount (mannequin weights + KV cache) at 2-bit precision; TurboQuant achieves 6× reminiscence discount at 3-bit precision with no calibration, working close to the information-theoretic restrict.
- Low-rank strategies (Palu, LoRC, MLA) goal hidden dimension redundancy and stay underexplored relative to token eviction.
- Architectural options (GQA, MLA) have to be included at coaching time. In Llama 2, solely the 70B mannequin used GQA; Llama 3 prolonged it throughout all sizes. MLA achieves a 93.3% KV cache discount in DeepSeek-V2.
- The 2026 analysis frontier is transferring towards latent-space compaction (Attention Matching, 50× compaction) and reasoning-aware compression (TriAttention, 10.7× reminiscence discount on AIME25 at matched accuracy).
Also, be at liberty to comply with us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to (*10*). Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The put up Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods appeared first on MarkTechPost.
