|

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

Training massive language fashions on lengthy sequences has a well known drawback: consideration is dear. The scaled dot-product consideration (SDPA) at the core of each transformer scales quadratically Θ(N²) in each compute and reminiscence with sequence size N. FlashAttention addressed this by IO-aware tiling that avoids materializing the complete N×N consideration matrix in high-bandwidth reminiscence, decreasing the reminiscence footprint considerably, however the underlying Θ(N²) compute scaling stays. Researchers at Nous Research have launched a brand new methodology known as Lighthouse Attention that addresses this bottleneck particularly at pretraining time, reaching a 1.40× to 1.69× end-to-end wall-clock speedup towards a cuDNN-backed SDPA baseline, with matching or decrease closing coaching loss.

The core drawback with current sparse consideration strategies

To perceive why Lighthouse works the best way it does, it helps to know what current sparse consideration strategies do. Most prior work like NSA, HISA, DSA, MoBA makes the identical two design choices. First, they pool solely the important thing and worth aspect whereas leaving queries at full decision (uneven compression). Second, their choice logic lives inside a customized consideration kernel, which suggests groups can’t reuse the optimized dense-attention kernels that trendy GPU tensor cores are constructed round.

There can be a priority particular to coaching that inference-only sparse strategies don’t face. An inference-time sparse methodology is evaluated solely towards its dense spine and it’s at most pretty much as good as that spine. A training-time sparse methodology faces a more durable take a look at: as soon as coaching is completed, will the ensuing weights nonetheless produce a reliable dense-attention mannequin at inference? Lighthouse treats that query as its central correctness criterion.

Lighthouse takes a distinct strategy on each design choices. It swimming pools queries, keys, and values symmetrically throughout a multi-level pyramid, and it locations choice solely outdoors the eye kernel. After choice, the system gathers the chosen entries right into a contiguous, dense sub-sequence and runs inventory FlashAttention on it — the identical kernel utilized by the dense baseline.

https://arxiv.org/pdf/2605.06554

How the four-stage pipeline works

A Lighthouse consideration layer wraps round, however doesn’t modify, scaled dot-product consideration. The pipeline has 4 phases.

In the primary stage, common pooling constructs an L-level pyramid from Q, Okay, and V. With pooling issue p, stage ℓ of the pyramid has N/p^ℓ tokens, every summarizing p^ℓ base positions. Crucially, the identical pooling applies to all three projections, producing coherent (Q^(ℓ), Okay^(ℓ), V^(ℓ)) triples at each stage. Total pyramid development prices Θ(N) time and reminiscence.

In the second stage, a parameter-free scorer assigns every pyramid entry two scalar scores utilizing per-head ℓ₂ norms: one as a question rating (∥Q^(ℓ)_i∥₂) and one as a key rating (∥Okay^(ℓ)_i∥₂). Coarser ranges inherit scores from finer ones by way of max-pooling, so a rough span picks up the significance of its strongest token. A fused chunked-bitonic top-Okay kernel then selects okay entries collectively throughout all pyramid ranges. One design element price noting: the coarsest pyramid stage is at all times retained in full — it’s low cost and ensures at least one contributor at each base place; the remaining choice finances is spent on finer ranges. Additionally, the chunked-bitonic design produces a stratified top-Okay reasonably than a strict international top-Okay: the rating stream is partitioned into fixed-size chunks, every sustaining an in-register top-m buffer, so if the okay globally highest-scoring entries clustered in a single chunk, some would get replaced by lower-scoring entries from different chunks. The result’s extra balanced consideration protection throughout the sequence and avoids choice collapse onto a slim span.

The top-Okay step is discrete and non-differentiable — no straight-through estimator, no Gumbel softmax. Selection indices carry no gradient. Gradients stream solely by the gathered Q, Okay, V entries into WQ, WK, WV, so the projections study to provide values which might be helpful when chosen reasonably than scores which might be good at choosing.

In the third stage, the chosen entries are gathered right into a contiguous sub-sequence of size S = N/p^(L−1) + (L−1)·p·okay and handed to straightforward FlashAttention. At N = 1,000,000 with L = 4, p = 4, okay = 4,096, S ≈ 65,000 — far smaller than N. A essential property of the gathering course of is that it ensures no “holes” or empty areas within the assembled sub-sequence. This issues particularly as a result of Lighthouse additionally compresses queries: a niche within the sequence would imply these lacking tokens haven’t any gradient path in the course of the backward cross and will trigger coaching instabilities. Asymmetric strategies that depart queries at full decision don’t face this drawback, however Lighthouse’s symmetric design requires that the gathered sub-sequence stays totally dense.

In the fourth stage, every output entry is scattered again to the p^ℓ base positions it represents by way of a deterministic integer-atomic scatter kernel, with a shift of p^ℓ − 1 to protect causality. The per-position fan-in is bounded by L no matter okay.

https://arxiv.org/pdf/2605.06554

Why symmetric pooling modifications the compute

Pooling queries alongside keys and values modifications the computational character of the eye name from O(N Sd) to O(S² d) at coaching time. Because S ≪ N at lengthy contexts, that is what produces the latency benefit. Benchmarked on a single NVIDIA B200 at 512K context (bfloat16, B=1, H=8, head dimension 128, L=3, p=4, sparsity ≈ 1:64), Lighthouse is 21× quicker on the ahead cross and 17.3× quicker on the mixed ahead+backward cross relative to cuDNN-backed SDPA.

From an asymptotic standpoint, setting L = logp(N/okay) offers a gathered sub-sequence measurement of S = Θ(okay log N), which makes the dense FlashAttention name value Θ(k² log² N d) — polylogarithmic in N at fastened okay. Combined with the linear-cost phases (pyramid development, scoring, scatter-back), complete per-layer compute is Θ(T d) at bounded okay — the identical asymptotic class as linear consideration and SSMs — whereas preserving softmax consideration’s recall properties on the chosen sub-sequence.

Inference is a distinct constraint. Autoregressive decoding presents one question at a time, which violates the belief that every one queries co-occur in a single ahead cross. Lighthouse is a training-only methodology, and the symmetric pooling design can’t be used straight at inference.

The two-stage coaching recipe and recoverability

The experimental setup used a 530M-parameter Llama-3-style decoder (dmodel=1024, 30 layers, 8 heads, head dimension 128, FFN width 1536, byte-level tokenizer), educated on C4 at 98,304-token context with AdamW at studying price 2×10⁻³, β1=0.9, β2=0.95, weight decay 0.1, linear warmup over 2k steps, gradient-norm clip 1, bfloat16, and FSDP. One implementation element that issues for practitioners: of the 30 layers, layers {0, 1, 28, 29} retain dense SDPA all through — solely the opposite 26 layers use Lighthouse. The internal consideration name inside these 26 Lighthouse layers makes use of the identical cuDNN-backed SDPA kernel because the dense baseline.

The coaching aproach is two-stage. Stage 1 trains with Lighthouse choice enabled for almost all of the step finances. Stage 2 resumes the Stage 1 checkpoint beneath dense SDPA (identical optimizer state, identical dataloader) for a brief tail. If Stage 1 had hollowed out the mannequin’s dense-attention functionality, Stage 2 restoration would fail.

It doesn’t fail. Testing at a complete finances of 16,000 steps (~50.3B tokens), three break up factors (10k+6k, 11k+5k, 12k+4k) have been evaluated towards a dense-from-scratch SDPA baseline. At every resume level the coaching loss spikes transiently by 1.12–1.57 nats because the mannequin is first run by consideration it was not educated towards, then recovers inside roughly 1,000–1,500 SDPA steps and crosses under the dense baseline. By step 16,000, all three resumed Lighthouse runs attain closing losses of 0.6980–0.7102, towards the dense baseline’s 0.7237, whereas spending 22.5h to 27.0h wall-clock in comparison with 37.9h for dense-SDPA-from-scratch on the identical token finances.

Ablations and throughput

The full ablation grid covers scorer kind, pooling issue p, variety of pyramid ranges L, and top-Okay finances okay. Key findings: the projection-norm scorer is inside ~0.01 of the dilated softmax-attention scorer in both route (no uniform winner) however is roughly 9% cheaper in B200-hours, because it skips the eye cross over the pyramid solely. Shallower pyramids (L=3) persistently outperform deeper ones (L=4, L=5) at matched budgets. Smaller okay values produce decrease post-resume loss throughout the examined vary — the lowest-loss configuration throughout the grid is L=3, p=2, okay=1536 with the dilated scorer, reaching a closing lack of 0.6825 — a counter-intuitive consequence the analysis groups attribute to hierarchical choice appearing as a regularizer at this token finances scale.

Stage-1 throughput throughout the ablation grid ranges from 84,000 to 126,000 tokens/s/GPU towards roughly 46,000 for dense SDPA. The projection-norm scorer at L=3, p=4, okay=1536 tops the vary at 126,000 tokens/s/GPU by skipping the dilated-attention cross solely.

Long-context retrieval

To complement the loss-based recoverability outcomes, the analysis crew ran a simplified Needle-in-a-Haystack (NIAH) analysis: a single passkey digit hidden in random alphanumeric filler at depths of 0–100% throughout context lengths of 4K to 96K tokens, with retrieval scored as a one-token argmax over the ten digit tokens (random probability: 10%). Four Lighthouse configurations (various okay ∈ {1536, 2048} and scorer ∈ {dilated, norm} at L=3, p=4) have been examined towards the dense-SDPA-from-scratch baseline. Three of 4 Lighthouse runs match or beat the dense baseline’s imply retrieval price of 0.72: okay=2048 dilated reaches 0.76, okay=1536 dilated reaches 0.73, and okay=2048 norm matches the baseline at 0.72. Only okay=1536 norm dips, to 0.65. A sample emerges throughout the grid: bigger okay is the dominant axis for retrieval efficiency, and the norm scorer hurts retrieval greater than it hurts coaching loss at the identical okay. The sensible implication is that the optimum configuration relies on whether or not the downstream job is loss-driven or retrieval-driven.

Context parallelism scaling

For sequences past ~100K tokens, Lighthouse runs beneath context parallelism (CP). Pyramid pooling, scoring, and top-Okay run shard-locally on every rank with no inter-rank communication, for the reason that coarsest pool window (e.g., 64 tokens) is orders of magnitude smaller than the shard measurement. The gathered sub-sequence is dense, so it participates in customary ring consideration with out sparse-aware collectives — one thing sparse-index-based strategies can not do with out engineering particular to the sparse format. Context parallelism introduces roughly 10% per-rank throughput overhead from ring rotation, however the Lighthouse vs. SDPA speedup ratio is preserved. The methodology scales to 1M-token coaching throughout 32 Blackwell GPUs (4 nodes, CP diploma 8) with no modifications to the internal consideration kernel.

Marktechpost’s Visual Explainer

Lighthouse Attention

Nous Research  —  arXiv:2605.06554

TRAINING-ONLY

01  / The Problem

Why Long-Context Training Is Expensive

Every transformer makes use of scaled dot-product consideration (SDPA), which computes a rating between each token and each different token within the sequence. As sequence size N grows, this value scales as Θ(N²) in each compute and reminiscence — it doubles the price for each ~1.4× enhance in context.

FlashAttention diminished this by utilizing IO-aware tiling that avoids ever materializing the complete N×N consideration matrix in high-bandwidth reminiscence, reducing reminiscence footprint considerably. But the underlying Θ(N²) compute scaling is unchanged — the wall continues to be there.

Θ(N²)
SDPA compute & reminiscence scaling
1M
token context frontier fashions goal
32
B200 GPUs wanted for 1M-token coaching

The consequence: groups both practice at shorter contexts than they need, or spend monumental compute budgets on consideration alone. Lighthouse Attention is a technique that wraps round customary SDPA throughout pretraining to scale back this value, then will get eliminated so the ultimate mannequin is a standard dense-attention mannequin at inference.

02  / Prior Work

What Existing Sparse Attention Gets Wrong

Several strategies already attempt to cut back the eye value by attending to solely a subset of tokens. But most share two design choices that create issues for pretraining.

⚠ Problem 1: Asymmetry

Methods like NSA, HISA, InfLLM-v2 pool solely keys and values however depart queries at full decision. The hierarchy turns into a compressed reminiscence reasonably than a real multi-scale illustration. It additionally means the dense consideration name stays O(N·S·d) as a substitute of shrinking additional.

⚠ Problem 2: Kernel Entanglement

Methods like NSA, DSA, HISA, MoBA embed choice logic inside a customized consideration kernel. This means they can’t reuse the optimized FlashAttention kernels that GPU tensor cores are constructed round. Every sparse methodology ships its personal ahead and backward kernels.

The hardest drawback: An inference-only sparse methodology is mechanically pretty much as good as its dense spine. A training-time sparse methodology should reply a more durable query: as soon as coaching is completed, will the ensuing weights nonetheless work as a reliable dense-attention mannequin at inference? Most strategies don’t take a look at this.

Lighthouse Attention treats this recoverability query as its central correctness criterion.

03  / The Method

Lighthouse Attention: Core Idea

Lighthouse is a selection-based hierarchical consideration that wraps round, however doesn’t modify, the eye kernel. It provides a pre-processing step that selects a small subset of tokens, runs inventory FlashAttention on simply that subset, and scatters the output again. At the top of coaching, you disable Lighthouse and maintain the dense mannequin.

Two key design variations from prior work:
✓  Queries, keys, and values are all pooled symmetrically (not simply keys/values)
✓  Selection sits outdoors the eye kernel — FlashAttention runs on a standard dense sub-sequence
21×
quicker ahead cross vs SDPA at 512K context
17.3×
quicker ahead+backward at 512K context
1.69×
end-to-end pretraining wall-clock speedup

The methodology introduces no new learnable parameters and no auxiliary losses. The scoring perform is parameter-free, and the top-Okay choice step is intentionally non-differentiable — no straight-through estimator or Gumbel softmax.

04  / Architecture

The Four-Stage Pipeline

A Lighthouse consideration layer replaces the usual SDPA name with 4 phases. Stages 1 and 4 are customized kernels; phases 2 and three are customary PyTorch operations fused by torch.compile.

1
Pyramid Pool

Average-pool Q, Okay, and V symmetrically into an L-level pyramid with pooling issue p. Level ℓ has N/pⁿ tokens, every summarizing pⁿ base positions. Total value: Θ(N). Crucially, the coarsest stage is at all times retained in full to ensure at least one contributor per base place.

2
Score + Top-Okay Selection

Each pyramid entry will get two scalar scores utilizing its per-head ℓ₂ norm: one as a question rating, one as a key rating. A fused chunked-bitonic top-Okay kernel selects okay entries collectively throughout all pyramid ranges. This step is non-differentiable — indices carry no gradient.

3
Dense Gather + FlashAttention

Selected (Q, Okay, V) triples are gathered right into a contiguous sub-sequence of size S = N/pⁿ⁻¹ + (L−1)·p·okay, then handed to inventory FlashAttention. No customized sparse kernel. The gathered sequence has no holes, which is crucial as a result of queries are additionally compressed.

4
Scatter-Back

Each output entry is scattered again to the pⁿ base positions it represents by way of an integer-atomic scatter kernel. The output is totally dense. Per-position fan-in is bounded by L no matter okay.

05  / Key Design Choice

Why Symmetric Q/Okay/V Pooling Matters

Most prior hierarchical strategies pool solely Okay and V whereas leaving Q at full decision. Lighthouse swimming pools all three. This shouldn’t be beauty — it modifications the mathematics of the eye name.

Method Query aspect Attention value
NSA, HISA, InfLLM-v2 Full decision (N) O(N · S · d)
Lighthouse Pooled (S) O(S² · d)

Because S ≪ N at lengthy contexts, O(S²·d) is dramatically cheaper than O(N·S·d). At N = 1,000,000 with L=4, p=4, okay=4096, S ≈ 65,000.

The no-holes assure: Compressing queries means each question place should have a gradient path. Lighthouse ensures no gaps within the gathered sub-sequence, which prevents coaching instabilities that may come up from tokens with lacking gradients. Asymmetric strategies that depart Q at full decision don’t face this drawback.

At bounded okay, setting L = logᵣ(N/okay) offers complete per-layer compute of Θ(T·d) — the identical asymptotic class as linear consideration and SSMs, however with softmax consideration’s recall properties on the chosen sub-sequence.

06  / Gradient Flow

Non-Differentiable Selection, Differentiable Training

The top-Okay step is discrete. Lighthouse intentionally doesn’t approximate it with a straight-through estimator or Gumbel softmax. This is a aware design selection.

What does NOT get gradients

The choice indices and the scoring perform. The ℓ₂ norm scorer is rarely educated — it has no parameters and receives no gradient sign.

What DOES get gradients

Gradients stream by scatter-back → FlashAttention → collect into the gathered Q̃, Okaỹ, Ṽ and on into W_Q, W_K, W_V.

The consequence: the projection matrices study to produce values which might be helpful when chosen, not scores which might be good at choosing. This avoids the optimization issues — scorer collapse, scorer–consideration misalignment, auxiliary loss tuning — that learnable selectors in NSA and DSA are vulnerable to.

Complexity comparability throughout consideration households (per-layer compute at bounded okay):

Dense softmax: Θ(T² · d)
Log-linear consideration: Θ(T log T · d)
Lighthouse (bounded okay): Θ(T · d)
Linear consideration / SSMs: Θ(T · d)

07  / Training Recipe

Two-Stage Training and Recoverability

The central declare of Lighthouse is that sparse coaching doesn’t break the mannequin’s capability to make use of dense consideration at inference. The two-stage recipe is how that is validated.

1
Stage 1 — Lighthouse pretraining

Train for almost all of the step finances with Lighthouse choice energetic. This is the quick stage: ~2× larger throughput than dense SDPA.

2
Stage 2 — Dense SDPA resumption

Resume the Stage 1 checkpoint beneath customary dense SDPA with the identical optimizer state and dataloader. The loss spikes transiently by 1.12–1.57 nats, then recovers inside ~1,000–1,500 SDPA steps and crosses under the dense baseline.

Tested at 16,000 complete steps (~50.3B tokens) on a 530M Llama-3-style mannequin (dmodel=1024, 30 layers, H=8, head dim 128, FFN 1536, byte-level tokenizer, C4 dataset, 98,304-token context) throughout three break up factors:

Split B200–Hrs Tok/s (okay) Final Loss
Dense SDPA baseline 303.2 45.6 0.7237
LH 12k + SDPA 4k 214.7 74.7 0.7102
LH 11k + SDPA 5k 219.6 75.4 0.7001
LH 10k + SDPA 6k 228.0 75.0 0.6980

All three Lighthouse runs beat the dense baseline at matched token budgets.

08  / Implementation Detail

Not All Layers Use Lighthouse

An necessary element for practitioners: within the 30-layer experimental mannequin, layers {0, 1, 28, 29} retain dense SDPA all through. Only the remaining 26 layers use Lighthouse. The internal consideration name inside these Lighthouse layers makes use of the identical cuDNN-backed SDPA kernel because the dense baseline.

This means Lighthouse is a partial alternative, not a full model-wide substitution. The first and final layers retaining dense consideration is a sensible stabilization selection — these boundary layers usually carry disproportionate significance for mannequin conduct.

Optimizer setup: AdamW, lr 2×10⁻³, β₁=0.9, β₂=0.95, weight decay 0.1, linear warmup over 2k steps, gradient-norm clip 1, bfloat16, FSDP solely.

Chunked-bitonic top-Okay: The kernel produces a stratified top-Okay, not a strict international top-Okay. Score stream is partitioned into fixed-size chunks; every chunk maintains an in-register buffer. If the globally highest-scoring entries clustered in a single chunk, some are changed by lower-scoring entries from different chunks — guaranteeing each area of the sequence contributes tokens and stopping consideration from collapsing onto a slim span.

S = N / p^(L-1) + (L-1) * p * okay

# Example: N=1M, L=4, p=4, okay=4096
# S = 1,000,000/64 + 3*4*4096
# S = 15,625 + 49,152 ≈ 65,000  (vs 1,000,000 for full consideration)

09  / Ablations

What the Hyperparameter Sweep Shows

The full ablation grid various scorer kind, pooling issue p, pyramid ranges L, and top-Okay finances okay. All configurations used the 10k+6k break up at 98K context.

Config Scorer B200–Hrs Tok/s (okay) Final Loss
SDPA baseline 303.2 45.6 0.7237
L=3, p=2, okay=1536 Dilated 203.9 93.9 0.6825
L=3, p=4, okay=1536 Dilated 197.2 99.5 0.6881
L=3, p=4, okay=1536 Norm 179.6 126.0 0.6946
L=3, p=2, okay=4096 Dilated 215.7 83.5 0.6951

Key findings from the sweep:

Smaller okay → higher loss (counter-intuitive)
Shallower L=3 beats L=4, L=5
Norm scorer: 9% cheaper, related high quality
Every config beats dense baseline

The counter-intuitive discovering on okay: loss decreases monotonically as okay shrinks from 4,096 to 1,536. The authors attribute this to hierarchical choice appearing as a regularizer at the 50.3B-token finances. Whether this reverses at bigger budgets is left to future work.

10  / Retrieval Evaluation

Needle-in-a-Haystack Results

Beyond coaching loss, the paper evaluates long-context retrieval utilizing a simplified Needle-in-a-Haystack (NIAH) take a look at: a single passkey digit hidden in random alphanumeric filler at depths of 0–100% throughout context lengths of 4K–96K tokens. Retrieval is scored as a one-token argmax over the ten digit tokens. Random probability is 10%.

Configuration Mean Retrieval Rate vs Baseline
Dense SDPA baseline 0.72
okay=2048, Dilated scorer 0.76 +0.04
okay=1536, Dilated scorer 0.73 +0.01
okay=2048, Norm scorer 0.72 Matches
okay=1536, Norm scorer 0.65 −0.07
Three of 4 Lighthouse configurations match or beat the dense-from-scratch baseline on retrieval. The norm scorer hurts retrieval greater than it hurts coaching loss at the identical okay. The sensible implication: in case your downstream job is retrieval-heavy, use a bigger okay and the dilated scorer. If optimizing for loss and throughput, the norm scorer with okay=1536 is the higher trade-off.

11  / Scaling

Context Parallelism at 1M Tokens

For sequences past ~100K tokens, the 530M mannequin OOMs on a single B200 no matter consideration methodology (activations + gradients + optimizer state). Lighthouse extends to multi-GPU context parallelism (CP) cleanly.

1
Shard-local pre-attention

Each rank holds a contiguous slice of the sequence. Pyramid pooling, scoring, and top-Okay all run shard-locally. The coarsest pool window (e.g., 64 tokens) is way smaller than the shard measurement (N/W ≈ 128K at N=1M, W=8), so no inter-rank communication is required at this stage.

2
Standard ring consideration

The gathered sub-sequence is dense, so it participates in customary ring consideration with no sparse-aware collectives. KV shards rotate by the ring as in a totally dense long-context run. Sparse-index-based strategies can not do that — ring rotation requires a contiguous tensor, which their sparse outputs will not be.

~10%
ring-rotation overhead in CP vs single-device
1M
token coaching context achieved
4×8
nodes × GPUs, CP diploma 8

The Lighthouse vs. SDPA speedup ratio is totally preserved beneath matched CP geometry, carrying the benefit cleanly into the 1M-token regime.

12  / Limitations & Resources

Limitations and Open Directions

Key limitation: Symmetric Q/Okay/V pooling presumes all queries co-occur in a single ahead cross. Autoregressive decoding presents one question at a time — this violates that assumption. Lighthouse is a training-only methodology and depends on the dense-SDPA resumption to provide an inference-ready mannequin. The gathered sub-sequence value is Θ(S²·d): sub-quadratic in N at fastened okay, however not strictly linear. Regimes the place okay should scale with N stay uncharacterized.

Open instructions from the paper:

Asymmetric sparse resumption (DSA / NSA / MoBA goal)
Per-layer / per-head adaptive okay
Vision, audio, video pyramid extensions
Serving integration (steady batching, KV-cache)

Paper

arXiv:2605.06554
“Long Context Pre-Training with Lighthouse Attention”
Peng, Ghosh, Quesnelle — Nous Research

Code

github.com/ighoshsubho/
lighthouse-attention
Patch on upstream torchtitan + 2 new recordsdata

Scorer variants: norm, dilated, gla — selectable from config. CP path requires norm scorer.


1 / 12

Key Takeaways

  • Nous Research's Lighthouse Attention swimming pools Q, Okay, and V symmetrically throughout a multi-level pyramid — in contrast to NSA and HISA which solely pool Okay and V — reducing the eye name from O(N S d) to O(S² d) and making the costly step inventory FlashAttention on a small dense sub-sequence.
  • It's a training-only methodology: a quick dense-SDPA resumption at the top converts the checkpoint into a standard full-attention mannequin that matches or beats dense-from-scratch at the identical token finances (closing loss 0.6980–0.7102 vs. 0.7237 baseline, 16k steps, ~50.3B tokens).
  • At 512K context on a single B200, Lighthouse is 21× quicker on the ahead cross and 17.3× quicker on ahead+backward vs. cuDNN SDPA — translating to a 1.40×–1.69× end-to-end pretraining wall-clock speedup.
  • The top-Okay choice step is intentionally non-differentiable — no straight-through estimator, no Gumbel softmax — so projection matrices study to provide values which might be helpful when chosen, to not recreation a learnable scorer.
  • Scales to 1M-token coaching throughout 32 Blackwell GPUs (4 nodes, CP diploma 8) beneath context parallelism with no modifications to the internal consideration kernel, as a result of the gathered sub-sequence is dense and participates in customary ring consideration.

Check out the Paper, GitHub Repo and Technical detailsAlso, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context appeared first on MarkTechPost.

Similar Posts