Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context
Training massive language fashions on lengthy sequences has a well known drawback: consideration is dear. The scaled dot-product consideration (SDPA) at the core of each transformer scales quadratically Θ(N²) in each compute and reminiscence with sequence size N. FlashAttention addressed this by IO-aware tiling that avoids materializing the complete N×N consideration matrix in high-bandwidth reminiscence, decreasing the reminiscence footprint considerably, however the underlying Θ(N²) compute scaling stays. Researchers at Nous Research have launched a brand new methodology known as Lighthouse Attention that addresses this bottleneck particularly at pretraining time, reaching a 1.40× to 1.69× end-to-end wall-clock speedup towards a cuDNN-backed SDPA baseline, with matching or decrease closing coaching loss.
The core drawback with current sparse consideration strategies
To perceive why Lighthouse works the best way it does, it helps to know what current sparse consideration strategies do. Most prior work like NSA, HISA, DSA, MoBA makes the identical two design choices. First, they pool solely the important thing and worth aspect whereas leaving queries at full decision (uneven compression). Second, their choice logic lives inside a customized consideration kernel, which suggests groups can’t reuse the optimized dense-attention kernels that trendy GPU tensor cores are constructed round.
There can be a priority particular to coaching that inference-only sparse strategies don’t face. An inference-time sparse methodology is evaluated solely towards its dense spine and it’s at most pretty much as good as that spine. A training-time sparse methodology faces a more durable take a look at: as soon as coaching is completed, will the ensuing weights nonetheless produce a reliable dense-attention mannequin at inference? Lighthouse treats that query as its central correctness criterion.
Lighthouse takes a distinct strategy on each design choices. It swimming pools queries, keys, and values symmetrically throughout a multi-level pyramid, and it locations choice solely outdoors the eye kernel. After choice, the system gathers the chosen entries right into a contiguous, dense sub-sequence and runs inventory FlashAttention on it — the identical kernel utilized by the dense baseline.

How the four-stage pipeline works
A Lighthouse consideration layer wraps round, however doesn’t modify, scaled dot-product consideration. The pipeline has 4 phases.
In the primary stage, common pooling constructs an L-level pyramid from Q, Okay, and V. With pooling issue p, stage ℓ of the pyramid has N/p^ℓ tokens, every summarizing p^ℓ base positions. Crucially, the identical pooling applies to all three projections, producing coherent (Q^(ℓ), Okay^(ℓ), V^(ℓ)) triples at each stage. Total pyramid development prices Θ(N) time and reminiscence.
In the second stage, a parameter-free scorer assigns every pyramid entry two scalar scores utilizing per-head ℓ₂ norms: one as a question rating (∥Q^(ℓ)_i∥₂) and one as a key rating (∥Okay^(ℓ)_i∥₂). Coarser ranges inherit scores from finer ones by way of max-pooling, so a rough span picks up the significance of its strongest token. A fused chunked-bitonic top-Okay kernel then selects okay entries collectively throughout all pyramid ranges. One design element price noting: the coarsest pyramid stage is at all times retained in full — it’s low cost and ensures at least one contributor at each base place; the remaining choice finances is spent on finer ranges. Additionally, the chunked-bitonic design produces a stratified top-Okay reasonably than a strict international top-Okay: the rating stream is partitioned into fixed-size chunks, every sustaining an in-register top-m buffer, so if the okay globally highest-scoring entries clustered in a single chunk, some would get replaced by lower-scoring entries from different chunks. The result’s extra balanced consideration protection throughout the sequence and avoids choice collapse onto a slim span.
The top-Okay step is discrete and non-differentiable — no straight-through estimator, no Gumbel softmax. Selection indices carry no gradient. Gradients stream solely by the gathered Q, Okay, V entries into WQ, WK, WV, so the projections study to provide values which might be helpful when chosen reasonably than scores which might be good at choosing.
In the third stage, the chosen entries are gathered right into a contiguous sub-sequence of size S = N/p^(L−1) + (L−1)·p·okay and handed to straightforward FlashAttention. At N = 1,000,000 with L = 4, p = 4, okay = 4,096, S ≈ 65,000 — far smaller than N. A essential property of the gathering course of is that it ensures no “holes” or empty areas within the assembled sub-sequence. This issues particularly as a result of Lighthouse additionally compresses queries: a niche within the sequence would imply these lacking tokens haven’t any gradient path in the course of the backward cross and will trigger coaching instabilities. Asymmetric strategies that depart queries at full decision don’t face this drawback, however Lighthouse’s symmetric design requires that the gathered sub-sequence stays totally dense.
In the fourth stage, every output entry is scattered again to the p^ℓ base positions it represents by way of a deterministic integer-atomic scatter kernel, with a shift of p^ℓ − 1 to protect causality. The per-position fan-in is bounded by L no matter okay.

Why symmetric pooling modifications the compute
Pooling queries alongside keys and values modifications the computational character of the eye name from O(N Sd) to O(S² d) at coaching time. Because S ≪ N at lengthy contexts, that is what produces the latency benefit. Benchmarked on a single NVIDIA B200 at 512K context (bfloat16, B=1, H=8, head dimension 128, L=3, p=4, sparsity ≈ 1:64), Lighthouse is 21× quicker on the ahead cross and 17.3× quicker on the mixed ahead+backward cross relative to cuDNN-backed SDPA.
From an asymptotic standpoint, setting L = logp(N/okay) offers a gathered sub-sequence measurement of S = Θ(okay log N), which makes the dense FlashAttention name value Θ(k² log² N d) — polylogarithmic in N at fastened okay. Combined with the linear-cost phases (pyramid development, scoring, scatter-back), complete per-layer compute is Θ(T d) at bounded okay — the identical asymptotic class as linear consideration and SSMs — whereas preserving softmax consideration’s recall properties on the chosen sub-sequence.
Inference is a distinct constraint. Autoregressive decoding presents one question at a time, which violates the belief that every one queries co-occur in a single ahead cross. Lighthouse is a training-only methodology, and the symmetric pooling design can’t be used straight at inference.
The two-stage coaching recipe and recoverability
The experimental setup used a 530M-parameter Llama-3-style decoder (dmodel=1024, 30 layers, 8 heads, head dimension 128, FFN width 1536, byte-level tokenizer), educated on C4 at 98,304-token context with AdamW at studying price 2×10⁻³, β1=0.9, β2=0.95, weight decay 0.1, linear warmup over 2k steps, gradient-norm clip 1, bfloat16, and FSDP. One implementation element that issues for practitioners: of the 30 layers, layers {0, 1, 28, 29} retain dense SDPA all through — solely the opposite 26 layers use Lighthouse. The internal consideration name inside these 26 Lighthouse layers makes use of the identical cuDNN-backed SDPA kernel because the dense baseline.
The coaching aproach is two-stage. Stage 1 trains with Lighthouse choice enabled for almost all of the step finances. Stage 2 resumes the Stage 1 checkpoint beneath dense SDPA (identical optimizer state, identical dataloader) for a brief tail. If Stage 1 had hollowed out the mannequin’s dense-attention functionality, Stage 2 restoration would fail.
It doesn’t fail. Testing at a complete finances of 16,000 steps (~50.3B tokens), three break up factors (10k+6k, 11k+5k, 12k+4k) have been evaluated towards a dense-from-scratch SDPA baseline. At every resume level the coaching loss spikes transiently by 1.12–1.57 nats because the mannequin is first run by consideration it was not educated towards, then recovers inside roughly 1,000–1,500 SDPA steps and crosses under the dense baseline. By step 16,000, all three resumed Lighthouse runs attain closing losses of 0.6980–0.7102, towards the dense baseline’s 0.7237, whereas spending 22.5h to 27.0h wall-clock in comparison with 37.9h for dense-SDPA-from-scratch on the identical token finances.
Ablations and throughput
The full ablation grid covers scorer kind, pooling issue p, variety of pyramid ranges L, and top-Okay finances okay. Key findings: the projection-norm scorer is inside ~0.01 of the dilated softmax-attention scorer in both route (no uniform winner) however is roughly 9% cheaper in B200-hours, because it skips the eye cross over the pyramid solely. Shallower pyramids (L=3) persistently outperform deeper ones (L=4, L=5) at matched budgets. Smaller okay values produce decrease post-resume loss throughout the examined vary — the lowest-loss configuration throughout the grid is L=3, p=2, okay=1536 with the dilated scorer, reaching a closing lack of 0.6825 — a counter-intuitive consequence the analysis groups attribute to hierarchical choice appearing as a regularizer at this token finances scale.
Stage-1 throughput throughout the ablation grid ranges from 84,000 to 126,000 tokens/s/GPU towards roughly 46,000 for dense SDPA. The projection-norm scorer at L=3, p=4, okay=1536 tops the vary at 126,000 tokens/s/GPU by skipping the dilated-attention cross solely.
Long-context retrieval
To complement the loss-based recoverability outcomes, the analysis crew ran a simplified Needle-in-a-Haystack (NIAH) analysis: a single passkey digit hidden in random alphanumeric filler at depths of 0–100% throughout context lengths of 4K to 96K tokens, with retrieval scored as a one-token argmax over the ten digit tokens (random probability: 10%). Four Lighthouse configurations (various okay ∈ {1536, 2048} and scorer ∈ {dilated, norm} at L=3, p=4) have been examined towards the dense-SDPA-from-scratch baseline. Three of 4 Lighthouse runs match or beat the dense baseline’s imply retrieval price of 0.72: okay=2048 dilated reaches 0.76, okay=1536 dilated reaches 0.73, and okay=2048 norm matches the baseline at 0.72. Only okay=1536 norm dips, to 0.65. A sample emerges throughout the grid: bigger okay is the dominant axis for retrieval efficiency, and the norm scorer hurts retrieval greater than it hurts coaching loss at the identical okay. The sensible implication is that the optimum configuration relies on whether or not the downstream job is loss-driven or retrieval-driven.
Context parallelism scaling
For sequences past ~100K tokens, Lighthouse runs beneath context parallelism (CP). Pyramid pooling, scoring, and top-Okay run shard-locally on every rank with no inter-rank communication, for the reason that coarsest pool window (e.g., 64 tokens) is orders of magnitude smaller than the shard measurement. The gathered sub-sequence is dense, so it participates in customary ring consideration with out sparse-aware collectives — one thing sparse-index-based strategies can not do with out engineering particular to the sparse format. Context parallelism introduces roughly 10% per-rank throughput overhead from ring rotation, however the Lighthouse vs. SDPA speedup ratio is preserved. The methodology scales to 1M-token coaching throughout 32 Blackwell GPUs (4 nodes, CP diploma 8) with no modifications to the internal consideration kernel.
Marktechpost’s Visual Explainer
Lighthouse Attention
Nous Research — arXiv:2605.06554
1 / 12
Key Takeaways
- Nous Research's Lighthouse Attention swimming pools Q, Okay, and V symmetrically throughout a multi-level pyramid — in contrast to NSA and HISA which solely pool Okay and V — reducing the eye name from O(N S d) to O(S² d) and making the costly step inventory FlashAttention on a small dense sub-sequence.
- It's a training-only methodology: a quick dense-SDPA resumption at the top converts the checkpoint into a standard full-attention mannequin that matches or beats dense-from-scratch at the identical token finances (closing loss 0.6980–0.7102 vs. 0.7237 baseline, 16k steps, ~50.3B tokens).
- At 512K context on a single B200, Lighthouse is 21× quicker on the ahead cross and 17.3× quicker on ahead+backward vs. cuDNN SDPA — translating to a 1.40×–1.69× end-to-end pretraining wall-clock speedup.
- The top-Okay choice step is intentionally non-differentiable — no straight-through estimator, no Gumbel softmax — so projection matrices study to provide values which might be helpful when chosen, to not recreation a learnable scorer.
- Scales to 1M-token coaching throughout 32 Blackwell GPUs (4 nodes, CP diploma 8) beneath context parallelism with no modifications to the internal consideration kernel, as a result of the gathered sub-sequence is dense and participates in customary ring consideration.
Check out the Paper, GitHub Repo and Technical details. Also, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The put up Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context appeared first on MarkTechPost.
