Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Long-context inference makes the KV cache one of many principal prices of serving LLMs. During autoregressive decoding, the cache grows with context size, batch measurement, and mannequin depth. At excessive batch sizes and lengthy contexts with 100K tokens throughout dozens of concurrent requests the KV cache consumes a big fraction of GPU reminiscence. Compressing it’s a direct option to improve batch measurement and scale back reminiscence site visitors.

The apparent method is quantization. But pushing KV caches to INT2 (2-bit) precision has been largely impractical. Prior strategies both collapse in accuracy or require customized serving layouts incompatible with paged KV-cache programs. Together AI’s OSCAR (Offline Spectral Covariance-Aware Rotation) addresses each issues.

Why INT2 KV Cache Quantization is Hard

KV activations comprise channel-wise outliers. A small subset of channels holds extraordinarily massive values. Most channels are well-behaved. When you apply INT2 quantization which has solely 4 representable ranges and people outliers dominate the dimensions issue. The quantizer wastes most of its vary on uncommon spikes. Normal values get compressed into only one or two efficient ranges. This degrades consideration high quality considerably.

Rotation-based quantization addresses this by making use of a set orthogonal rework, sometimes a Hadamard rework, to redistribute outlier vitality throughout all channels. This method works moderately nicely at INT4. At INT2, a deeper drawback stays: the rotation is data-oblivious. It can easy activation ranges, but it surely doesn’t know which instructions the eye mechanism really reads. Spreading quantization error uniformly just isn’t the identical as pushing it into low-importance instructions. At INT2, with solely 4 ranges, that distinction determines whether or not the mannequin works in any respect.

What OSCAR Does Differently

OSCAR’s key remark is that the rotation utilized earlier than quantization ought to be derived from consideration statistics themselves — not from the uncooked distribution of KV activations.

For keys, the downstream error that issues just isn’t the Euclidean reconstruction error of Ok. It is the error in consideration logits. The analysis workforce confirmed this error is: ‖QK^⊤ − QK̂^⊤‖²F = tr((Ok − Ok̂)Q^⊤Q(Ok − Ok̂)^⊤). The weighting matrix is the question covariance Q^⊤Q, not Ok^⊤Ok. Directions the place queries have massive vitality amplify quantization errors in logits. OSCAR estimates the empirical question covariance CQ = (1/N) Σ qn^⊤qn from a calibration set, eigen-decomposes it, and makes use of the eigenvectors UQ as the important thing rotation foundation.

For values, the related error is within the consideration output SV. This depends upon how the eye rating matrix S weights every worth row. The analysis workforce defines the score-weighted worth covariance CS = (1/N) V^⊤S^⊤SV. Directions that stay massive after aggregation by S are those quantization error propagates by way of. OSCAR makes use of the eigenvectors US of CS as the worth rotation foundation.

The remaining composed rotations are:

RK = UQ · HHad · Pbr
RV = US · HHad · Pbr

Each of the three components addresses a definite failure mode of per-group low-bit quantization:

UQ / US aligns channels with attention-importance instructions. This diagonalizes the error-weighting matrix so crucial instructions are identifiable.
HHad (Walsh-Hadamard rework) then equalizes channel significance precisely. Lemma 1 within the analysis paper proves each diagonal entry of H_Had^⊤ Λ H_Had equals tr(Λ)/d — the peaky eigenspectrum uncovered by UQ is compressed to a uniform worth throughout all channels.
Pbr (permuted bit-reversal) reorders channels in order that for any power-of-two quantization group measurement, every group receives one consultant from every degree of the significance hierarchy.

The analysis workforce supplies Theorem 1 proving UQ and US are optimum below a frozen-error surrogate goal with diagonal residual assumptions.

The Serving System: Mixed-Precision Cache Layout

OSCAR integrates into SGLang’s manufacturing serving stack as an INT2 KV-cache mode with full compatibility with paged consideration.

The KV cache structure makes use of three areas per request:

Sink tokens (first S0 = 64 tokens): saved in BF16. These perform as consideration sinks.
Recent tokens (final W = 256 tokens earlier than present place): saved in BF16.
History tokens (the whole lot in between): saved as INT2 after OSCAR rotation and clipping.

At 128K context size, the BF16 sink and up to date home windows signify solely 0.24% of complete tokens. The ablation (Table 5 within the analysis paper) reveals (S=64, R=256) is the accuracy-efficiency knee: smaller home windows noticeably harm accuracy; bigger home windows give negligible extra profit at increased BF16 reminiscence price.

Write and browse paths use fused Triton kernels. On the write path, every token is rotated, clipped to a calibration-derived percentile threshold (typical values: cK = 0.96, cV = 0.92), then quantized with per-token uneven INT2 at a default group measurement of GK = 64 channels per group. On the learn path, the INT2 kernel unpacks bytes, dequantizes, inverse-rotates, and passes outcomes to the eye kernel — multi functional fused go with out additional reminiscence site visitors. The worth rotation RV is absorbed into the mannequin’s projection weights offline, eliminating its on-line compute price.

Outcome

The analysis workforce evaluated OSCAR on 4 mannequin configurations: Qwen3-4B-Thinking-2507, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 (358B parameters). Benchmarks embody AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500, all at 32K most technology size.

Accuracy (at 2.28 bits per KV component):

Model	BF16 Mean	OSCAR Mean	Gap to BF16
Qwen3-4B-Thinking-2507	75.64	71.86	−3.78
Qwen3-8B	70.84	69.42	−1.42
Qwen3-32B	74.19	74.17	−0.02
GLM-4.7-FP8 (358B)	77.89	78.16	+0.27

For context on how competing strategies evaluate: naive INT2 (no rotation) scores 0.00 on each Qwen3-4B and Qwen3-8B. QuaRot-INT2 (Hadamard-only rotation) scores 1.40 on Qwen3-4B and 10.14 on Qwen3-8B. TurboQuant at 3.25 bits drops 43.90 factors on Qwen3-4B-Thinking. Saw-INT4 at 4.25 bits reaches 73.11 on Qwen3-4B — OSCAR at 2.28 bits reaches 71.86.

The analysis workforce additionally in contrast towards channel-wise strategies on AIME25 (Table 1). On Qwen3-8B, OSCAR at 2.38 BPE achieves 66.67±3.33 — above KIVI-KV2* at 57.67 (2.26 BPE) and Kitty at 59.67 (2.39 BPE). Note that channel-wise strategies require residual buffers or customized web page layouts that don’t match commonplace paged-attention serving, so this comparability is proscribed to the only shared benchmark the place outcomes have been accessible.

Long-context robustness (RULER-NIAH):

Model	Method	16K	32K	64K	128K
Qwen3-4B-Thinking	BF16	99.7	99.3	85.3	81.0
Qwen3-4B-Thinking	QuaRot-INT2	0.0	0.0	15.6	0.0
Qwen3-4B-Thinking	OSCAR	97.8	87.6	61.9	39.5
Qwen3-8B	BF16	98.9	97.3	79.2	78.2
Qwen3-8B	QuaRot-INT2	19.0	9.8	0.0	0.0
Qwen3-8B	OSCAR	93.9	86.3	61.9	45.0

On GLM-4.7-FP8, OSCAR matches the BF16 curve by way of 128K.

Throughput (H100, 100K context, batch measurement 1):

Decode throughput speedup relative to BF16, at growing context lengths:

Model	30K	60K	100K
Qwen3-4B-Thinking	1.98×	2.52×	3.08×
Qwen3-8B	1.84×	2.29×	2.88×
GLM-4.7-FP8	1.98×	2.49×	2.83×

At batch measurement 32, job-level throughput at 100K context reaches 6.17× over BF16 on Qwen3-4B-Thinking and seven.83× on GLM-4.7-FP8. The speedup will increase with context size as a result of decoding turns into more and more KV-bandwidth-bound. Reducing KV reminiscence by 8× instantly reduces that bottleneck. The on-line rotation overhead is absorbed into the decode kernels.

Marktechpost’s Visual Explainer

OSCAR — How-To Guide
01 / 08

Overview

What is OSCAR?

OSCAR (Offline Spectral Covariance-Aware Rotation) is a 2-bit KV cache quantization system from Together AI for long-context LLM serving.

Instead of making use of a generic Hadamard rotation, OSCAR derives attention-aware rotations from a one-time offline calibration go — aligning quantization noise with instructions that focus is least delicate to.

The consequence: INT2 precision with near-BF16 accuracy and full compatibility with paged KV-cache serving.

8×
KV Memory Reduction

3×
Decode Speedup

2.28
Bits Per KV Element

Setup

Prerequisites

Before getting began, be sure to have the next in place:

01
Hardware: NVIDIA H100 GPU (80 GB) beneficial. A100 may match for smaller fashions.
02
SGLang put in: OSCAR is built-in into the SGLang serving framework. Install the newest model from supply.
03
Triton: Custom fused kernels are written in Triton. Triton ships with most up-to-date PyTorch / SGLang installs.
04
A supported mannequin: Qwen3-4B, Qwen3-8B, Qwen3-32B, GLM-4.7-FP8, or MiniMax-M2.7. Pre-computed rotations can be found for all of those.

pip set up sglang[all] --upgrade
pip set up triton

Step 1

Download Pre-Computed Rotations through RotationZoo

Together AI publishes pre-computed rotation matrices and clip thresholds for supported fashions in RotationZoo on ModelScope. No recalibration wanted.

from modelscope import snapshot_download

# Download RotationZoo for your mannequin
rotation_path = snapshot_download(
    'togethercomputer/OSCAR-RotationZoo'
)

The downloaded artifact accommodates per-layer RK, RV rotation matrices and clip thresholds cK, cV for every supported mannequin. These are mounted offline parameters — they don’t seem to be up to date at runtime.

Qwen3-4B / 8B / 32B2.28 BPE

GLM-4.7-FP8 (358B)2.28 BPE

MiniMax-M2.72.28 BPE

Custom (run calibration)any mannequin

Step 2 (Optional)

Run Offline Calibration for a Custom Model

If your mannequin just isn’t in RotationZoo, run the one-time calibration go. OSCAR dumps Q, Ok, V activations from a small dataset, estimates attention-aware covariance, and writes out rotation matrices and clip thresholds.

python calibrate_oscar.py 
  --model-path /path/to/your-model 
  --calib-data gpqa_diamond 
  --calib-tokens 8192 
  --output-dir ./oscar_rotations/

Calibration just isn’t task-specific. The paper reveals that outcomes are low-sensitivity to area (MMLU, WikiText, GPQA-Diamond all produce comparable accuracy). Run it as soon as and reuse throughout all duties.

Typical values produced: cK ≈ 0.96, cV ≈ 0.92 per layer.

Step 3

Launch SGLang with INT2 KV Cache Enabled

Pass the rotation path and allow INT2 KV mode when launching the SGLang server.

python -m sglang.launch_server 
  --model-path Qwen/Qwen3-8B 
  --kv-cache-dtype int2 
  --oscar-rotation-path ./oscar_rotations/ 
  --oscar-sink-size 64 
  --oscar-recent-size 256 
  --tp 1 
  --port 30000

Tensor parallelism is supported. For Qwen3-32B use --tp 2 (2×H100). For GLM-4.7-FP8 use --tp 8 (8×H100).

The server exposes an ordinary OpenAI-compatible API. No client-side modifications are wanted.

Step 4

Key Configuration Parameters

Parameter	Default	What it controls
–oscar-sink-size	64	First N tokens saved in BF16 as consideration sinks
–oscar-recent-size	256	Last N tokens saved in BF16 earlier than present place
cK (clip ratio)	0.96	Percentile clip for rotated key activations
cV (clip ratio)	0.92	Percentile clip for rotated worth activations
Group measurement GK	64	Channels per INT2 quantization group (head dim)

The paper identifies (sink=64, latest=256) because the accuracy-efficiency knee. Smaller home windows scale back accuracy noticeably; bigger home windows add BF16 reminiscence overhead with negligible achieve.

Step 5

Run Inference and Verify

Once the server is working, question it with the usual OpenAI shopper:

from openai import OpenAI

shopper = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="none"
)

response = shopper.chat.completions.create(
    mannequin="Qwen/Qwen3-8B",
    messages=[{"role": "user",
               "content": "Your long-context prompt here"}],
    max_tokens=1024
)
print(response.selections[0].message.content material)

Prefix caching works out of the field. OSCAR preserves the usual paged KV-cache abstraction, so SGLang’s radix cache and prefix reuse perform usually. No application-level modifications are wanted.

Results

Accuracy vs BF16 Baseline

Averaged throughout AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500 at 32K technology size.

Qwen3-4B-Thinking

−3.78

Qwen3-8B

−1.42

Qwen3-32B

−0.02

GLM-4.7-FP8 (358B)

+0.27

Paper: arXiv:2605.17757 RotationZoo: modelscope.cn/fashions/togethercomputer/OSCAR-RotationZoo

Key Takeaways

OSCAR quantizes LLM KV caches to 2-bit precision by rotating activations utilizing attention-aware covariance matrices, not generic Hadamard transforms.
At 2.28 bits per KV component, OSCAR stays inside 3.78 factors of BF16 accuracy on Qwen3-4B-Thinking whereas naive INT2 collapses to zero.
KV cache reminiscence drops roughly 8×, decode pace improves as much as 3× at 100K context, and job-level throughput reaches as much as 7.83× at massive batch sizes.
Pre-computed rotation matrices for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7 can be found in RotationZoo — no recalibration wanted.
OSCAR integrates instantly into SGLang with full paged KV-cache and prefix cache compatibility, requiring no modifications to the inference shopper.

Check out the Repo on GitHub, Modelscope and Research Paper. Also, be happy to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The submit Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving appeared first on MarkTechPost.

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Why INT2 KV Cache Quantization is Hard

What OSCAR Does Differently

The Serving System: Mixed-Precision Cache Layout

Outcome

Marktechpost’s Visual Explainer

Key Takeaways

OpenAI has Released the ‘circuit-sparsity’: A Set of Open Tools for Connecting Weight Sparse Models and Dense Baselines through Activation Bridges

How to Build Production-Ready Agentic Systems with Z.AI GLM-5 Using Thinking Mode, Tool Calling, Streaming, and Multi-Turn Workflows

Google AI Introduces Consistency Training for Safer Language Models Under Sycophantic and Jailbreak Style Prompts

Liquid AI Releases LFM2-8B-A1B: An On-Device Mixture-of-Experts with 8.3B Params and a 1.5B Active Params per Token

Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude

Zyphra Releases ZAYA1-8B-Diffusion-Preview: The First MoE Diffusion Model Converted From an Autoregressive LLM With Up to 7.7x Speedup

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Why INT2 KV Cache Quantization is Hard

What OSCAR Does Differently

The Serving System: Mixed-Precision Cache Layout

Outcome

Marktechpost’s Visual Explainer

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!