The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

Long-context giant language fashions (LLMs) face a reminiscence bottleneck that has nothing to do with mannequin weights. During decoding, transformers cache the important thing and worth (KV) vectors for each token at each layer in order that they don’t must recompute consideration. This cache grows linearly with sequence size and batch measurement, and at lengthy context with excessive concurrency it will probably dwarf the mannequin’s personal footprint.

Consider Llama-3.1-70B in BF16. Its KV cache prices about 0.31 MB per token (80 layers × 8 KV heads × 128 head-dim × 2 tensors × 2 bytes). At 128K tokens that’s ~40 GB; at 1M tokens it exceeds 300 GB — greater than the 140 GB of weights themselves. Worse, each newly decoded token has to stream your complete cache out of high-bandwidth reminiscence (HBM), which makes decoding memory-bandwidth-bound reasonably than compute-bound. Shrinking the KV cache is due to this fact probably the most direct lever for reducing each value and decode latency.

Current approaches fall into roughly 5 households: token eviction (H2O, SnapKV), quantization (KIVI, GEAR), low-rank projection (Palu), merging (KVMerger), and architectural sharing (MLA). Recent 2026 work has pushed exhausting on the ultra-low-bit quantization frontier. Google and NYU’s TurboQuant (ICLR 2026) and Together AI’s OSCAR assault the identical drawback from reverse instructions, whereas Apple’s EpiCache tackles an issue neither one addresses.

Most KV quantizers are preventing the identical underlying enemy: outlier channels — a handful of channels with disproportionately giant magnitudes that dominate the quantization vary and squeeze the remainder of the sign into just some representable ranges. This is why naive INT2 quantization (solely 4 ranges) collapses to near-zero accuracy.

KIVI established the usual baseline right here. It confirmed that key vectors have fastened outlier channels throughout tokens whereas worth vectors don’t, so it quantizes keys per-channel and values per-token. That tuning-free 2-bit recipe cuts end-to-end peak reminiscence (weights included) by about 2.6×, and it’s the reference level the newer strategies construct on.

TurboQuant: data-oblivious and theoretically optimum

TurboQuant handles outliers with out ever your information, in two levels:

Stage one: every vector is randomly rotated so its coordinates develop into practically unbiased and roughly Gaussian, which lets an optimum precomputed scalar (Lloyd–Max) quantizer be utilized per coordinate.
Stage two: a 1-bit Quantized Johnson–Lindenstrauss (QJL) remodel is utilized to the residual, giving a provably unbiased estimate of consideration logits with no normalization-constant overhead.

The promoting level is theoretical: TurboQuant’s distortion is provably inside a small fixed issue (≈ 2.7×) of the information-theoretic decrease certain. In observe it reaches basically full-precision recall on Needle-in-a-Haystack at 4× compression, and the paper stories absolute high quality neutrality at 3.5 bits and solely marginal degradation at 2.5 bits per channel. Because it wants no calibration, it really works on any mannequin untouched and doubles as a quick vector-database quantizer.

One caveat value flagging: the broadly repeated “8× sooner consideration on H100” determine comes from Google’s blog, not the paper, and refers to a slender attention-logit microbenchmark. TurboQuant’s documented candy spot is the three–4 bit near-lossless regime.

*Image supply:* Data from the TurboQuant paper – https://arxiv.org/abs/2504.19874

OSCAR: attention-aware and deployment-ready

OSCAR bets the other manner. Its premise is that at INT2’s 4 ranges, a data-oblivious rotation is the incorrect device — blindly smoothing ranges isn’t sufficient when there’s nearly no precision to spare. So OSCAR computes an attention-aware rotation from a one-time offline calibration move: keys are rotated into the eigenbasis of the question covariance, values into the score-weighted worth covariance. A Hadamard remodel plus a bit-reversal permutation then unfold channel significance evenly throughout the quantization teams.

What units OSCAR aside is that it ships as a whole system, not simply an algorithm:

Mixed-precision paged cache: sink and up to date tokens keep in BF16 whereas the historical past compresses to INT2 — at 128K context solely ~0.24% of tokens stay in BF16.
Fused Triton kernels with full SGLang integration (paged-attention and prefix-cache appropriate).
Precomputed rotations (a “RotationZoo”) for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7 — no recalibration wanted.

At an efficient 2.28 bits, OSCAR lands inside 1.42 factors of BF16 on Qwen3-8B and is basically on par on Qwen3-32B (a 0.02-point hole). On GLM-4.7-FP8 — the place naive INT2 collapses to zero and data-oblivious baselines attain solely low single digits — OSCAR matches BF16 and even edges barely forward on the reported benchmarks (inside noise). Together AI stories as much as 7.83× job-level throughput and roughly 8× KV-cache reminiscence discount at 100K context, with as much as ~3× sooner decoding.

Image Source- Data from the OSCAR paper: https://arxiv.org/abs/2605.17757

So which one wins?

Neither — and that’s the sincere reply. For deployable INT2 at 128K tokens on supported fashions, OSCAR is at the moment the one demonstrated possibility that doesn’t collapse, and it comes with production-ready SGLang help. For training-free, model-agnostic quantization within the 3–4 bit regime, TurboQuant provides far broader generality.

OSCAR’s paper stories that TurboQuant drops by greater than 40 factors at a comparable finances — however that analysis runs inside OSCAR’s personal framework, quantizes all layers, makes use of a single random seed, and operates nicely under TurboQuant’s supposed bit-width, so it’s a weak foundation for a head-to-head verdict. The extra attention-grabbing risk is that the 2 are complementary: pairing a calibration-aware rotation with an optimum scalar quantizer is a promising mixture no person has shipped but. (Both groups have publicly famous the identical concept.)

Image supply: Data from the OSCAR paper- https://arxiv.org/abs/2605.17757

The third axis: EpiCache

TurboQuant and OSCAR are each constructed for a single lengthy context. Neither handles prolonged multi-turn conversations, the place historical past piles up throughout many exchanges. Apple’s EpiCache is a training-free KV-cache administration framework aimed precisely at that hole:

Block-wise prefill processes historical past in blocks to maintain peak reminiscence bounded.
Episodic clustering segments the dialog into coherent semantic “episodes,” every with its personal compressed cache.
Episode-matched retrieval routes every question to probably the most related episode at inference time.
Adaptive layer-wise finances allocation measures every layer’s sensitivity to eviction and distributes the reminiscence finances accordingly.

Across LongMemEval, RealTalk, and LoCoMo, EpiCache stories as much as 40% greater accuracy than eviction baselines, near-full-cache accuracy at 4–6× compression, and as much as 3.5× decrease peak reminiscence (and ~2.4× decrease latency). Because it decides which tokens to maintain reasonably than how exactly to retailer them, it composes instantly with OSCAR or TurboQuant for compounding financial savings.

Key Takeaways

TurboQuant pushes the theoretical, model-agnostic frontier — the go-to for 3–4 bit near-lossless compression on any mannequin.
OSCAR leads on deployable INT2, with as much as 7.83× throughput and ~8× reminiscence discount at 100K context on supported fashions.
EpiCache solves conversational reminiscence throughout turns — as much as 40% accuracy positive factors over eviction and three.5× decrease peak reminiscence — and composes with both quantizer.
Pick by constraint: bit-width finances, mannequin portability, or dialog size, then mix the orthogonal strategies that match. These approaches are extra complementary than aggressive.

Sources

The submit The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache appeared first on MarkTechPost.

The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

TurboQuant: data-oblivious and theoretically optimum

OSCAR: attention-aware and deployment-ready

So which one wins?

The third axis: EpiCache

Key Takeaways

Sources

Why AI safety breaks at the system level

Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding

Is your most capable AI agent also your biggest data leak?

Google DeepMind’s WeatherNext 2 Uses Functional Generative Networks For 8x Faster Probabilistic Weather Forecasts

Mistral AI Unveils Mistral Medium 3.1: Enhancing AI with Superior Performance and Usability

An Implementation Guide to Running NVIDIA Transformer Engine with Mixed Precision, FP8 Checks, Benchmarking, and Fallback Execution

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

TurboQuant: data-oblivious and theoretically optimum

OSCAR: attention-aware and deployment-ready

So which one wins?

The third axis: EpiCache

Key Takeaways

Sources

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!