Memory as Computation, Not Storage
April 20, 2026
The Problem with Standard Attention
In most language models, context is handled by attention over tokens. Every token can attend to every other token. This works, but it’s inefficient.
The model must:
locate relevant information,
filter noise,
and reason, all at once.
As context grows, the problem becomes harder, not just larger.
A Different Approach
Instead of operating directly on long token sequences, we introduce a latent memory:
Context is compressed into a fixed set of slots (
M_ctx)The question is processed separately and interacts with this memory (
M_q)The decoder generates from both
This creates a separation:
memory formation (what matters)
memory usage (what’s relevant now)
Why This Matters
This architecture introduces a strong bottleneck:
512 tokens → 32 memory slots
variable context → fixed representation
The model is forced to:
discard irrelevant information
organize what remains
reuse it efficiently
In practice, this changes the learning dynamics significantly.
Early Observations
In internal experiments on small models:
We observe ≥5.5× improvements in efficiency (depending on setup)
Models generalize better at the same parameter count
Training appears more stable
We also note that the memory size (k) plays a key role.
A simple analysis suggests that reducing interactions from token-token to token-slot could scale roughly with:
context_length / number_of_slots
In our current setting, this ratio is 16.
We do not claim a 16× end-to-end improvement, but it provides an upper-bound intuition for why gains appear.
What Memory Becomes
The interesting part is what the slots learn.
They don’t store text.
They seem to organize into:
semantic clusters
reusable abstractions
structured representations of context
In other words, memory becomes less like a cache — and more like a set of latent features.
Implications
If this holds:
small models may benefit disproportionately from structured memory
context length may become less critical than memory capacity
scaling may shift from parameters → representations
This aligns with our broader direction:
smaller models, better structure
Next Steps
We’re currently exploring:
scaling laws for slot-based memory
ablations on memory size and grouping
interaction with external memory systems
compatibility with sparse / event-driven architectures
As always, we prefer results over claims, more to come.