What We Learned Trying to Reinvent Attention

May 23, 2026

The Hypothesis

Standard attention computes Q·K similarity between every pair of tokens.
This is dense, quadratic, and information-blind: the mechanism doesn't know which past tokens matter for the current query — it has to compute to find out.

We had a different intuition. What if past tokens could declare their future relevance? Each token, when processed, would emit forward declarations: "future tokens of vocab X, Y, Z should attend to me." A new token would then route by direct lookup — no similarity computation, no top-K selection over the full context. Just vocab-indexed retrieval of pre-declared relevance.

We called this SPaRK: Sparse Push-based attention with Relevance Keys.
On paper, it had everything we wanted: distance-agnostic routing, bounded compute per token, explicit interpretable structure.

In practice, after six architectural iterations and fifteen training runs, it didn't work. Here is why — and what actually did.

The Core Mechanism

For each token at position t with vocab v, SPaRK does three things:

1. Looks up declarations: which past tokens declared v as a future attender? Those positions form the attention scope.

2. Inheritance: if position p is in scope, anything p attends to also becomes accessible (depth 1).

3. Cold start fallback: if no declarations match, use a sliding window of the most recent W tokens to ensure non-zero scope.

After processing, the token emits its own forward declarations via a small projection head: hidden state → vocab logits → top-K declared.

Implemented end-to-end differentiable via Straight-Through Estimator over hard top-K selection.

Initial Signal

The first proper implementation (v6) trained in 22 minutes — fourteen times faster than our dense baseline at the same scale (3M parameters, Wikipedia FR, context 256). Validation perplexity matched baseline (47.6 vs 45).

At training context length, retrieval performance was strong: needle signal of +0.49 nats at ctx=256, compared to +0.29 for dense baseline and +0.18 for sliding window. The mechanism appeared to work.

Then we extrapolated.

The Failure Mode

At ctx=1024 (4× training context), SPaRK's mean retrieval signal dropped to +0.061, below both baseline (+0.098) and sliding window (+0.073).

At ctx=4096 (16× training), it went negative: -0.024. The needle was actively distracting the model rather than helping it.

We ran ablations. Five distinct architectural variants — push signatures with bottleneck, push signatures without bottleneck, NodeManager with explicit eviction, continuous Q·K routing in declaration space, vocab-id declarations with inheritance — all showed the same pattern: strong in-distribution, weak extrapolation.

The Underlying Pathology

The mechanism we observed across all variants:

decl_threshold drift. The learned threshold for what counts as a declaration consistently drifted toward negative values during training.
Lower threshold means more tokens pass the declaration test, which means the routing scope grows, which means the mechanism degenerates toward dense attention with extra overhead.

Given a choice between sparse routing and effective dense attention, the model reliably collapses the sparse mechanism. There is no learning signal that rewards sparsity per se — the next-token cross-entropy loss is happy with full attention, and full attention is what the model converges toward by abusing the threshold.

We tried forcing sparsity through auxiliary losses:

* Auxiliary BCE matching declarations to actual attention top-K: improved declaration accuracy (p@5), hurt retrieval signal. The gradient conflicted with the main objective.

* Hard top-K with no fallback: training collapsed. Val perplexity 109 versus baseline 45. Gradients too fragmented to learn meaningful routing.

* Knowledge distillation from a dense teacher: the student inherited the teacher's positional encoding behavior, undoing the extrapolation gains we found elsewhere.

The pattern is consistent. At our scale, the sparse routing mechanism adds parameters and complexity without corresponding learning pressure.
The model would rather be dense, and given any opportunity, it will be.

What Actually Worked

Two changes — neither novel architecturally — produced larger gains than any sparse routing variant:

1. ALiBi instead of RoPE for positional encoding. At ctx=4096, our RoPE-based variants showed retrieval signal -0.026 nats. ALiBi-based variants showed +0.096 nats. The same architecture with a different positional encoding gained 0.12 nats on out-of-distribution context.

2. More training epochs. Going from 5 to 12 epochs roughly doubled the mean retrieval signal (+0.110 to +0.276), at the same architecture, same compute per step.

Our best model — a pure transformer with ALiBi, no sparse mechanism, trained for 12 epochs — achieved val_ppl 35.4 and mean retrieval signal of +0.312 nats. It outperformed every SPaRK variant on every metric we measured, including extrapolation to 16× training context where it retained +0.135 mean signal versus -0.024 for the architecture we designed specifically for that case.

The Lesson

We had an architectural intuition that was correct in form: declaration-based routing is a real alternative to similarity-based attention. The problem is not the idea. The problem is the conditions under which it could work.

At 3M parameters trained on continuous text with 256-token contexts, the model has no incentive to develop long-range routing. The local distribution suffices for next-token prediction. The mechanism we built is essentially answering a question no one is asking — the loss function doesn't reward sparse routing over dense attention, so the model defaults to dense and optimizes the rest.

For declaration-based routing to emerge, we would need either:

* Training data with structured long-range dependencies (code with distant definitions, multi-document QA, books with delayed references) where the loss directly benefits from long-range retrieval.

* Larger models with the capacity to develop the routing as an emergent capability rather than a directly supervised one.

* Explicit auxiliary supervision targeting routing — but our attempts at this consistently conflicted with the main objective.

At Lambert's current scale, none of these conditions are met. The practical answer is to focus compute on simple, well-understood architectures and accept that novel routing mechanisms need a different regime to demonstrate value.

Where This Leaves Us

We have invested dozens of hours into SPaRK. Looking at our current results and what additional investment would realistically yield, we see no clear case for pushing further.

The mechanism is not broken. It simply does not earn its complexity in the regime we operate in. The kinds of training setups that could plausibly reward declaration-based routing — significantly larger scale, or data with explicit long-range structure — are not where Lambert is today.

At Korollr, we choose where to spend our compute deliberately. For now, that is not here.

We will revisit declaration-based routing when we have the scale or data regime that could plausibly reward it. The intuition still seems right to us — distance-agnostic content-addressable retrieval is a real thing models should be able to learn. We just have not found the conditions where they choose to.

Acknowledgments

This work was conducted as part of Lambert's architecture research program at Korollr. We thank the long-context attention research community — especially the authors of EM-LLM, Infini-Attention, and NSA — whose work shaped how we framed and tested these ideas. As always, we prefer results over claims. The result this time is that our claim was wrong. We thought you should know.