Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving
Long-context inference makes the KV cache one of many principal prices of serving LLMs. During autoregressive decoding, the cache grows with context size, batch measurement, and mannequin depth. At excessive batch sizes and lengthy contexts with 100K tokens throughout dozens of concurrent requests the KV cache consumes a big fraction of GPU reminiscence. Compressing it’s…
