All research Research

Memory as Computation, Not Storage

April 20, 2026

The Problem with Standard Attention

In most language models, context is handled by attention over tokens. Every token can attend to every other token. This works, but it’s inefficient.

The model must:

  • locate relevant information,

  • filter noise,

  • and reason, all at once.

    As context grows, the problem becomes harder, not just larger.

A Different Approach

Instead of operating directly on long token sequences, we introduce a latent memory:

  • Context is compressed into a fixed set of slots (M_ctx)

  • The question is processed separately and interacts with this memory (M_q)

  • The decoder generates from both

This creates a separation:

  • memory formation (what matters)

  • memory usage (what’s relevant now)

Why This Matters

This architecture introduces a strong bottleneck:

  • 512 tokens → 32 memory slots

  • variable context → fixed representation

The model is forced to:

  • discard irrelevant information

  • organize what remains

  • reuse it efficiently

In practice, this changes the learning dynamics significantly.

Early Observations

In internal experiments on small models:

  • We observe ≥5.5× improvements in efficiency (depending on setup)

  • Models generalize better at the same parameter count

  • Training appears more stable

We also note that the memory size (k) plays a key role.

A simple analysis suggests that reducing interactions from token-token to token-slot could scale roughly with:

context_length / number_of_slots

In our current setting, this ratio is 16.

We do not claim a 16× end-to-end improvement, but it provides an upper-bound intuition for why gains appear.

What Memory Becomes

The interesting part is what the slots learn.

They don’t store text.

They seem to organize into:

  • semantic clusters

  • reusable abstractions

  • structured representations of context

In other words, memory becomes less like a cache — and more like a set of latent features.

Implications

If this holds:

  • small models may benefit disproportionately from structured memory

  • context length may become less critical than memory capacity

  • scaling may shift from parameters → representations

This aligns with our broader direction:

smaller models, better structure

Next Steps

We’re currently exploring:

  • scaling laws for slot-based memory

  • ablations on memory size and grouping

  • interaction with external memory systems

  • compatibility with sparse / event-driven architectures

As always, we prefer results over claims, more to come.