Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving
Long-context inference makes the KV cache one of many principal prices of serving LLMs. During autoregressive decoding, the cache grows with context size, batch measurement, and mannequin depth. At excessive batch sizes and lengthy contexts with 100K tokens throughout dozens of concurrent requests the KV cache consumes a big fraction of GPU reminiscence. Compressing it’s a direct option to improve batch measurement and scale back reminiscence site visitors.
The apparent method is quantization. But pushing KV caches to INT2 (2-bit) precision has been largely impractical. Prior strategies both collapse in accuracy or require customized serving layouts incompatible with paged KV-cache programs. Together AI’s OSCAR (Offline Spectral Covariance-Aware Rotation) addresses each issues.
Why INT2 KV Cache Quantization is Hard
KV activations comprise channel-wise outliers. A small subset of channels holds extraordinarily massive values. Most channels are well-behaved. When you apply INT2 quantization which has solely 4 representable ranges and people outliers dominate the dimensions issue. The quantizer wastes most of its vary on uncommon spikes. Normal values get compressed into only one or two efficient ranges. This degrades consideration high quality considerably.
Rotation-based quantization addresses this by making use of a set orthogonal rework, sometimes a Hadamard rework, to redistribute outlier vitality throughout all channels. This method works moderately nicely at INT4. At INT2, a deeper drawback stays: the rotation is data-oblivious. It can easy activation ranges, but it surely doesn’t know which instructions the eye mechanism really reads. Spreading quantization error uniformly just isn’t the identical as pushing it into low-importance instructions. At INT2, with solely 4 ranges, that distinction determines whether or not the mannequin works in any respect.

What OSCAR Does Differently
OSCAR’s key remark is that the rotation utilized earlier than quantization ought to be derived from consideration statistics themselves — not from the uncooked distribution of KV activations.
For keys, the downstream error that issues just isn’t the Euclidean reconstruction error of Ok. It is the error in consideration logits. The analysis workforce confirmed this error is: ‖QK⊤ − QK̂⊤‖²F = tr((Ok − Ok̂)Q⊤Q(Ok − Ok̂)⊤). The weighting matrix is the question covariance Q⊤Q, not Ok⊤Ok. Directions the place queries have massive vitality amplify quantization errors in logits. OSCAR estimates the empirical question covariance CQ = (1/N) Σ qn⊤qn from a calibration set, eigen-decomposes it, and makes use of the eigenvectors UQ as the important thing rotation foundation.
For values, the related error is within the consideration output SV. This depends upon how the eye rating matrix S weights every worth row. The analysis workforce defines the score-weighted worth covariance CS = (1/N) V⊤S⊤SV. Directions that stay massive after aggregation by S are those quantization error propagates by way of. OSCAR makes use of the eigenvectors US of CS as the worth rotation foundation.
The remaining composed rotations are:
RK = UQ · HHad · PbrRV = US · HHad · Pbr
Each of the three components addresses a definite failure mode of per-group low-bit quantization:
- UQ / US aligns channels with attention-importance instructions. This diagonalizes the error-weighting matrix so crucial instructions are identifiable.
- HHad (Walsh-Hadamard rework) then equalizes channel significance precisely. Lemma 1 within the analysis paper proves each diagonal entry of
HHad⊤ Λ HHadequalstr(Λ)/d— the peaky eigenspectrum uncovered by UQ is compressed to a uniform worth throughout all channels. - Pbr (permuted bit-reversal) reorders channels in order that for any power-of-two quantization group measurement, every group receives one consultant from every degree of the significance hierarchy.
The analysis workforce supplies Theorem 1 proving UQ and US are optimum below a frozen-error surrogate goal with diagonal residual assumptions.
The Serving System: Mixed-Precision Cache Layout
OSCAR integrates into SGLang’s manufacturing serving stack as an INT2 KV-cache mode with full compatibility with paged consideration.
The KV cache structure makes use of three areas per request:
- Sink tokens (first S0 = 64 tokens): saved in BF16. These perform as consideration sinks.
- Recent tokens (final W = 256 tokens earlier than present place): saved in BF16.
- History tokens (the whole lot in between): saved as INT2 after OSCAR rotation and clipping.
At 128K context size, the BF16 sink and up to date home windows signify solely 0.24% of complete tokens. The ablation (Table 5 within the analysis paper) reveals (S=64, R=256) is the accuracy-efficiency knee: smaller home windows noticeably harm accuracy; bigger home windows give negligible extra profit at increased BF16 reminiscence price.

Write and browse paths use fused Triton kernels. On the write path, every token is rotated, clipped to a calibration-derived percentile threshold (typical values: cK = 0.96, cV = 0.92), then quantized with per-token uneven INT2 at a default group measurement of GK = 64 channels per group. On the learn path, the INT2 kernel unpacks bytes, dequantizes, inverse-rotates, and passes outcomes to the eye kernel — multi functional fused go with out additional reminiscence site visitors. The worth rotation RV is absorbed into the mannequin’s projection weights offline, eliminating its on-line compute price.
Outcome
The analysis workforce evaluated OSCAR on 4 mannequin configurations: Qwen3-4B-Thinking-2507, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 (358B parameters). Benchmarks embody AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500, all at 32K most technology size.
Accuracy (at 2.28 bits per KV component):
| Model | BF16 Mean | OSCAR Mean | Gap to BF16 |
|---|---|---|---|
| Qwen3-4B-Thinking-2507 | 75.64 | 71.86 | −3.78 |
| Qwen3-8B | 70.84 | 69.42 | −1.42 |
| Qwen3-32B | 74.19 | 74.17 | −0.02 |
| GLM-4.7-FP8 (358B) | 77.89 | 78.16 | +0.27 |
For context on how competing strategies evaluate: naive INT2 (no rotation) scores 0.00 on each Qwen3-4B and Qwen3-8B. QuaRot-INT2 (Hadamard-only rotation) scores 1.40 on Qwen3-4B and 10.14 on Qwen3-8B. TurboQuant at 3.25 bits drops 43.90 factors on Qwen3-4B-Thinking. Saw-INT4 at 4.25 bits reaches 73.11 on Qwen3-4B — OSCAR at 2.28 bits reaches 71.86.

The analysis workforce additionally in contrast towards channel-wise strategies on AIME25 (Table 1). On Qwen3-8B, OSCAR at 2.38 BPE achieves 66.67±3.33 — above KIVI-KV2* at 57.67 (2.26 BPE) and Kitty at 59.67 (2.39 BPE). Note that channel-wise strategies require residual buffers or customized web page layouts that don’t match commonplace paged-attention serving, so this comparability is proscribed to the only shared benchmark the place outcomes have been accessible.
Long-context robustness (RULER-NIAH):
| Model | Method | 16K | 32K | 64K | 128K |
|---|---|---|---|---|---|
| Qwen3-4B-Thinking | BF16 | 99.7 | 99.3 | 85.3 | 81.0 |
| Qwen3-4B-Thinking | QuaRot-INT2 | 0.0 | 0.0 | 15.6 | 0.0 |
| Qwen3-4B-Thinking | OSCAR | 97.8 | 87.6 | 61.9 | 39.5 |
| Qwen3-8B | BF16 | 98.9 | 97.3 | 79.2 | 78.2 |
| Qwen3-8B | QuaRot-INT2 | 19.0 | 9.8 | 0.0 | 0.0 |
| Qwen3-8B | OSCAR | 93.9 | 86.3 | 61.9 | 45.0 |
On GLM-4.7-FP8, OSCAR matches the BF16 curve by way of 128K.
Throughput (H100, 100K context, batch measurement 1):
Decode throughput speedup relative to BF16, at growing context lengths:
| Model | 30K | 60K | 100K |
|---|---|---|---|
| Qwen3-4B-Thinking | 1.98× | 2.52× | 3.08× |
| Qwen3-8B | 1.84× | 2.29× | 2.88× |
| GLM-4.7-FP8 | 1.98× | 2.49× | 2.83× |
At batch measurement 32, job-level throughput at 100K context reaches 6.17× over BF16 on Qwen3-4B-Thinking and seven.83× on GLM-4.7-FP8. The speedup will increase with context size as a result of decoding turns into more and more KV-bandwidth-bound. Reducing KV reminiscence by 8× instantly reduces that bottleneck. The on-line rotation overhead is absorbed into the decode kernels.
Marktechpost’s Visual Explainer
Key Takeaways
- OSCAR quantizes LLM KV caches to 2-bit precision by rotating activations utilizing attention-aware covariance matrices, not generic Hadamard transforms.
- At 2.28 bits per KV component, OSCAR stays inside 3.78 factors of BF16 accuracy on Qwen3-4B-Thinking whereas naive INT2 collapses to zero.
- KV cache reminiscence drops roughly 8×, decode pace improves as much as 3× at 100K context, and job-level throughput reaches as much as 7.83× at massive batch sizes.
- Pre-computed rotation matrices for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7 can be found in RotationZoo — no recalibration wanted.
- OSCAR integrates instantly into SGLang with full paged KV-cache and prefix cache compatibility, requiring no modifications to the inference shopper.
Check out the Repo on GitHub, Modelscope and Research Paper. Also, be happy to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving appeared first on MarkTechPost.
