Memory as Computation, Not Storage

April 20, 2026

In most language models, context is handled by attention over tokens. Every token can attend to every other token. This works, but it’s inefficient.

The model must:

filter noise,
and reason, all at once.

As context grows, the problem becomes harder, not just larger.

Instead of operating directly on long token sequences, we introduce a latent memory:

This creates a separation:

This architecture introduces a strong bottleneck:

The model is forced to:

In practice, this changes the learning dynamics significantly.

In internal experiments on small models:

We also note that the memory size (k) plays a key role.

A simple analysis suggests that reducing interactions from token-token to token-slot could scale roughly with:

context_length / number_of_slots

In our current setting, this ratio is 16.

We do not claim a 16× end-to-end improvement, but it provides an upper-bound intuition for why gains appear.

The interesting part is what the slots learn.

They don’t store text.

They seem to organize into:

In other words, memory becomes less like a cache — and more like a set of latent features.

If this holds:

This aligns with our broader direction:

smaller models, better structure

We’re currently exploring:

As always, we prefer results over claims, more to come.