Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks
The group behind Kimi.ai (Moonshot AI) simply made a major contribution to the open-source AI infrastructure area. The analysis group has made a major contribution to the open-source AI infrastructure area. They launched FlashKDA (Flash Kimi Delta Attention), a high-performance CUTLASS-based kernel implementation of the Kimi Delta Attention (KDA) mechanism. The FlashKDA library is offered on GitHub beneath an MIT license. It delivers prefill speedups of 1.72× to 2.22× over the flash-linear-attention baseline on NVIDIA H20 GPUs, and works as a drop-in backend for the favored flash-linear-attention library.
What Is Kimi Delta Attention, and Why Does It Matter?
To perceive FlashKDA, it helps to first perceive the place it sits within the LLM consideration panorama.
Standard softmax consideration has quadratic complexity with respect to sequence size — which means that as you feed longer context right into a mannequin, compute prices develop extraordinarily quick. This has pushed a wave of analysis into linear consideration mechanisms, which approximate or substitute the softmax operation to attain linear scaling. Kimi Delta Attention (KDA) is Moonshot AI’s contribution to this area: a linear consideration mechanism that refines the Gated DeltaInternet with a finer-grained, channel-wise gating mechanism, enabling simpler use of restricted finite-state RNN reminiscence.
KDA isn’t just a analysis prototype. It is the core consideration mechanism in Kimi Linear, Moonshot AI’s open-source hybrid mannequin with 48B complete parameters and 3B activated parameters. Kimi Linear makes use of a 3:1 KDA-to-MLA (Multi-Head Latent Attention) ratio — three KDA layers for each one international consideration layer — which reduces KV cache utilization by as much as 75% throughout long-sequence technology whereas attaining as much as 6× greater decoding throughput at 1 million context size in comparison with full consideration. FlashKDA is the production-grade CUDA kernel that makes that structure quick throughout prefill.
Concretely, the KDA ahead go takes in queries (q), keys (okay), values (v), a gate earlier than activation (g), and beta logits (beta), alongside with a scale issue, an output tensor (out), and gate parameters: A_log (log-gate parameter per head), dt_bias (gate bias), and lower_bound (gate decrease sure, starting from -5.0 to 0). The sigmoid activation on beta is utilized internally by the kernel. The mechanism additionally helps non-obligatory preliminary and closing recurrent states — helpful for multi-turn inference the place you wish to carry state throughout requests.
The recurrent formulation means the mannequin can effectively course of lengthy sequences throughout technology. But environment friendly prefill of those architectures nonetheless requires extremely optimized GPU kernels — which is strictly what FlashKDA delivers.
Under the Hood: CUTLASS on Hopper
FlashKDA is constructed on CUTLASS, NVIDIA’s open-source library of CUDA C++ template abstractions for high-performance linear algebra and customized kernel improvement. CUTLASS permits builders to jot down kernels that take full benefit of NVIDIA’s Tensor Core structure, and it’s the identical basis utilized by libraries like FlashAttention-3.
The library targets SM90 and above — which means NVIDIA’s Hopper structure (H100, H20) and newer. The minimal necessities are CUDA 12.9 and PyTorch 2.4. The codebase is predominantly CUDA (56.4%), with Python (36.2%) bindings and C++ (6.7%) glue code.
The core API is flash_kda.fwd, which takes the next inputs:
q,okay,v,g: all in bf16 with form[B, T, H, K]or[B, T, H, V](the placegis the gate earlier than activation)beta: bf16 beta logits in form[B, T, H](sigmoid utilized internally)scale: fp32 scalar scaling issueout: bf16 output tensor in form[B, T, H, V]A_log,dt_bias,lower_bound: fp32 gate parametersinitial_state,final_state: non-obligatory bf16 or fp32 recurrent statescu_seqlens: non-obligatory int64 cumulative sequence lengths for variable-length batching
One present constraint: the kernel requires Ok = V = 128 for head dimension.
The variable-length batching help by way of cu_seqlens is especially notable for manufacturing use. In actual inference serving, requests in a batch hardly ever share the identical sequence size. Being capable of pack a number of sequences of various lengths right into a single kernel name is a key requirement for high-throughput serving programs.
Benchmark Results: 1.72× to 2.22× on H20
The benchmark outcomes (as of April 20, 2026) examine flash_kda in opposition to fla_chunk_kda (the prevailing flash-linear-attention implementation) throughout a sequence size of T=8192, head dimension D=128, and two head rely configurations: H=96 and H=64. Each benchmark ran with 30 warmup iterations, 200 measurement iterations, and 5 repeats.
For H=96:
| Case | flash_kda (ms) |
fla_chunk_kda (ms) |
Speedup |
|---|---|---|---|
| Fixed | 2.6219 | 4.5052 | 1.72× |
Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063] |
2.3420 | 4.5717 | 1.95× |
Varlen, seq_lens=1024 × 8 |
2.0100 | 4.4668 | 2.22× |
For H=64:
| Case | flash_kda (ms) |
fla_chunk_kda (ms) |
Speedup |
|---|---|---|---|
| Fixed | 1.6199 | 2.9587 | 1.83× |
Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063] |
1.7027 | 3.0595 | 1.80× |
Varlen, seq_lens=1024 × 8 |
1.3930 | 3.0412 | 2.18× |
The peak speedup of two.22× seems within the uniform variable-length case (seq_lens=1024 × 8, eight sequences of size 1024 summing to T=8192). The fixed-length case delivers the ground of the vary at 1.72×. Across each head configurations and all three sequence eventualities, FlashKDA constantly outperforms the flash-linear-attention baseline by a major margin.
Integration with flash-linear-attention
One of probably the most sensible facets of FlashKDA is its integration story. Once put in, FlashKDA is auto-dispatched from flash-linear-attention’s chunk_kda — which suggests present codebases utilizing flash-linear-attention don’t want guide wiring to benefit from the quicker kernel. The integration is tracked in flash-linear-attention PR #852.
Installation is simple:
git clone https://github.com/MoonshotAI/FlashKDA.git flash-kda
cd flash-kda
git submodule replace --init --recursive
pip set up -v .
The correctness check suite (checks/test_fwd.py) runs exact-match verification in opposition to a PyTorch reference implementation and cross-validates in opposition to flash-linear-attention. This provides AI devs a dependable baseline for auditing kernel habits earlier than deploying in manufacturing.
Key Takeaways
- FlashKDA is Moonshot AI’s open-source CUTLASS-based CUDA kernel for Kimi Delta Attention (KDA), delivering 1.72×–2.22× prefill speedup over the
flash-linear-attentionbaseline on NVIDIA H20 GPUs. - KDA extends Gated DeltaInternet with fine-grained, channel-wise gating — it’s the core consideration mechanism behind Kimi Linear, a 48B-total / 3B-active-parameter hybrid mannequin that reduces KV cache utilization by as much as 75% and achieves as much as 6× greater decoding throughput at 1M context size.
- The kernel targets SM90+ {hardware} (NVIDIA Hopper — H100, H20 and above), requires CUDA 12.9+ and PyTorch 2.4+, and presently helps a set head dimension of
Ok = V = 128. - Variable-length batching is natively supported by way of the
cu_seqlensparameter, permitting a number of sequences of various lengths to be packed right into a single kernel name — a vital function for high-throughput inference serving. - Once put in, FlashKDA is auto-dispatched from
flash-linear-attention‘schunk_kda, making it a drop-in efficiency improve for any present codebase already utilizing theflash-linear-attentionlibrary — no structure modifications required.
Check out the GitHub Repo. Also, be happy to observe us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The put up Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks appeared first on MarkTechPost.
