|

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

The race to make massive language fashions sooner and cheaper to run has largely been fought at two ranges: the mannequin structure and the {hardware}. But there may be a third, typically underappreciated frontier — the GPU kernel. A kernel is the low-level computational routine that really executes a mathematical operation on the GPU. Writing a good one requires understanding not simply the mathematics, however the actual reminiscence format, instruction scheduling, and {hardware} quirks of the chip you’re concentrating on. Most ML professionals by no means write kernels straight; they rely on libraries like FlashAttention or Triton to do it for them.

Meet FlashQLA: a QwenLM’s contribution to this layer. Released underneath the MIT License and constructed on the TileLang compiler framework, it’s a high-performance linear consideration kernel library particularly optimized for the Gated Delta Network (GDN) consideration mechanism — the linear consideration structure that powers the Qwen3.5 and Qwen3.6 mannequin households.

What is Linear Attention and Why Does It Matter?

To perceive what FlashQLA solves, it helps to perceive what commonplace softmax consideration prices. In a standard Transformer, the eye mechanism has O(n²) complexity — that means that doubling the sequence size quadruples the computation. This is the basic bottleneck that makes processing lengthy paperwork, lengthy code information, or lengthy conversations costly.

Linear consideration replaces the softmax with a formulation that reduces this to O(n) complexity, making it scale far more favorably with sequence size. The Gated Delta Network (GDN) is one such linear consideration mechanism, and it has been built-in into Qwen’s hybrid mannequin structure, the place GDN layers alternate with commonplace full consideration layers. This hybrid design makes an attempt to get one of the best of each worlds: the expressiveness of full consideration the place it’s most wanted, and the effectivity of linear consideration all over the place else.

GDN makes use of what known as a ‘gated’ formulation — it applies an exponentially decaying gate to management how a lot previous context is carried ahead. This gate is vital to how FlashQLA achieves its efficiency good points.

The Problem with Existing Kernels

Before FlashQLA, the usual implementation for GDN operations got here from the Flash Linear Attention (FLA) library, which makes use of Triton kernels — Triton being OpenAI’s Python-based GPU programming language. While Triton makes kernel authoring extra accessible, it comes with trade-offs: the kernels it produces aren’t all the time optimally scheduled for particular {hardware}, notably on NVIDIA’s Hopper structure (the H100 and H200 GPU era).

The Hopper structure launched new options like warpgroup-level Tensor Core operations and asynchronous knowledge pipelines that Triton can’t all the time exploit to their full potential. This is the hole FlashQLA is designed to fill.

What FlashQLA Does Differently

FlashQLA applies operator fusion and efficiency optimization to each the ahead move (used throughout inference and coaching) and the backward move (used throughout coaching for gradient computation) of GDN Chunked Prefill. The result’s a 2–3× speedup on ahead passes and a 2× speedup on backward passes in contrast to the FLA Triton kernel throughout a number of eventualities on NVIDIA Hopper GPUs.

Three technical improvements drive these good points:

1. Gate-driven automated intra-card context parallelism: Context parallelism (CP) refers to splitting a lengthy sequence throughout a number of processing items to allow them to work on completely different components concurrently. FlashQLA exploits the exponential decay property of the GDN gate to make this break up mathematically legitimate — as a result of the gate’s decay implies that tokens far aside in a sequence have diminishing affect on one another. This permits FlashQLA to mechanically allow intra-card CP underneath tensor parallelism (TP), long-sequence, and small-head-count settings, enhancing GPU Streaming Multiprocessor (SM) utilization with out requiring guide configuration.

2. Hardware-friendly algebraic reformulation: FlashQLA reformulates, to a sure extent, the mathematical computation of GDN Chunked Prefill’s ahead and backward flows to scale back overhead on three sorts of GPU {hardware} items: Tensor Cores (which deal with matrix multiplications), CUDA Cores (which deal with scalar and vector operations), and the Special Function Unit (SFU, which handles operations like exponentials and sq. roots). Critically, that is achieved with out sacrificing numerical precision — an vital assure when the reformulation is getting used for mannequin coaching.

3. TileLang fused warp-specialized kernels: Rather than decomposing the computation into unbiased sequential kernels (too gradual) or fusing all the pieces into a single monolithic kernel (too inflexible to optimize), FlashQLA takes a center path. It makes use of TileLang to construct a number of key fused kernels and manually implements warpgroup specialization — a approach that assigns completely different warpgroups (teams of 128 threads on Hopper) to specialised roles, equivalent to one warpgroup transferring knowledge from world reminiscence to shared reminiscence whereas one other concurrently runs Tensor Core matrix multiplications. This overlap of information motion, Tensor Core computation, and CUDA Core computation is what permits FlashQLA to strategy the theoretical peak throughput of the {hardware}.

Benchmarks

FlashQLA was benchmarked towards two baselines: the FLA Triton kernel (model 0.5.0, Triton 3.5.1) and FlashInfer (model 0.6.9), utilizing TileLang 0.1.8, on NVIDIA H200 GPUs. The benchmarks used the pinnacle configurations from the Qwen3.5 and Qwen3.6 mannequin households, with head dimensions hv ∈ 64, 48, 32, 24, 16, 8, corresponding to tensor parallelism settings from TP1 via TP8.

The ahead (FWD) benchmarks measure single-kernel latency for various fashions and TP settings underneath various batch lengths. The backward (BWD) benchmarks study the connection between whole token depend inside a batch and latency throughout a single replace step.

https://qwen.ai/weblog?id=flashqla

Key Takeaways

  • FlashQLA is a high-performance linear consideration kernel library constructed by the Qwen group on TileLang, particularly optimized for the Gated Delta Network (GDN) Chunked Prefill ahead and backward passes.
  • It achieves 2–3× ahead speedup and a couple of× backward speedup over the FLA Triton kernel throughout a number of eventualities on NVIDIA Hopper GPUs (SM90+), with effectivity good points most pronounced in pretraining and edge-side agentic inference.
  • Three core improvements drive the efficiency good points: gate-driven automated intra-card context parallelism, hardware-friendly algebraic reformulation that reduces Tensor Core, CUDA Core, and SFU overhead with out dropping numerical precision, and TileLang fused warp-specialized kernels that overlap knowledge motion, Tensor Core computation, and CUDA Core computation.
  • GDN is a linear consideration mechanism with O(n) complexity, utilized in Qwen’s hybrid mannequin structure alongside commonplace full consideration layers — making environment friendly GDN kernels crucial for each coaching and long-context inference at scale.
  • FlashQLA is open-source underneath the MIT License and requires SM90 or above, CUDA 12.8+, and PyTorch 2.8+, with a easy pip set up and each high-level and low-level Python APIs obtainable for integration.

Check out the GitHub Repo and Technical details. Also, be happy to observe us on Twitter and don’t neglect to be part of our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs appeared first on MarkTechPost.

Similar Posts