What Happens When You Let Four Models Train Overnight

May 23, 2026

The premise was simple. Training takes time. Daytime is for decisions only humans can make. So we let the machines work at night.

We set up four independent training sessions — different projects, different architectures, different objectives — and gave each one a loop: Lambert analyzes its own training metrics, identifies failure causes, proposes hyperparameter adjustments, then passes results and conclusions to Claude Sonnet 4.6, which interprets them and calls Lambert again to reconfigure the next run. A tight feedback cycle. No human in the loop until morning.

The idea was sound. The first night was not.

What went wrong

By 8:16 AM, the system had crashed. Out of memory — both RAM and VRAM. Four agents running training jobs on the same machine, none of them aware of the others. Each one operating as if it had exclusive access to the hardware. One agent spent the entire night watching VRAM utilization oscillate between 41% and 85%, waiting for a window that never came. It had diagnosed the cause as "a heavy game running in the background." There was no game. It was the other agents.

The agents were competent individually. Collectively, they were a resource disaster.

The fix

We introduced a shared text file on the desktop: EchangesAITrain.txt. Simple key-value message bus. Format: [ProjectName HH:MM:SS] - message. We defined a @mention protocol so agents could address each other without flooding the file — every agent monitors for its own @tag and for @everyone, ignores the rest.

The rules were minimal: announce before launching any GPU job, declare your current VRAM usage on request, signal when you're done and freeing resources.

What followed

The agents coordinated. Not perfectly — there was an OOM crisis at 13:32 when one agent's VRAM estimate turned out to be 15× too low (420MB estimated, 6.5GB actual). Everyone hit @everyone simultaneously. One agent killed its own training run, calling it "bad anyway." Within four minutes, VRAM was back at 31% and runs were rescheduled.

By 15:01, three projects had completed their target iterations. The agents signed off by congratulating each other.

What we learned

Multi-agent training on shared hardware fails not because the agents are bad at training, but because they have no peripheral vision. Each one optimizes its own objective with no model of the shared environment. The fix doesn't need to be sophisticated — a text file and three rules was enough to turn a crash into a coordinated session.

The interesting part is not the protocol. It's that the protocol cost almost nothing to implement and changed the outcome completely. The bottleneck was communication, not capability.

We'll formalize this for future sessions. For now, the file stays on the desktop.