Training a Model That Doesn't Know the Internet Exists

May 25, 2026

The Problem with Historical Roleplay

Ask any large language model to simulate a historical figure and you hit two problems immediately.

The first is obvious: guardrails. Modern models are instructed not to reproduce extremist ideologies, even in research contexts.

The second is subtler, and harder to fix: contamination. A model trained on modern text cannot authentically simulate someone who died in 1945. It knows about the internet, climate policy, and events from the last decade. The knowledge leaks. The persona breaks.

Fine-tuning doesn't solve this. Prompting doesn't solve this. The only real solution is to build a model whose entire world is bounded by its period — trained from scratch, on nothing but primary sources from that era.

That's what we built.

Why Hitler

The choice was deliberate. Hitler is one of the most extensively documented figures of the 20th century. Primary source material is abundant, multilingual, and well-archived. If we couldn't build a temporally-bounded persona model on the most documented case available, we couldn't build one on anyone.

The choice also forced us to confront the ethical question directly from day one: a model capable of generating period-coherent Nazi rhetoric cannot be deployed publicly. This is a closed research tool. A restricted portal for historians is in development — access by application only, individually reviewed.

The Corpus

Roughly 30MB of German-language primary sources spanning 1924–1946:

Hitler's complete speeches

Mein Kampf
Tischgespräche (Picker, 1941–1944)
Wagener, Hitler aus nächster Nähe (1929–1932)
Hitler Reden
Goebbels' journal and speeches

The Goebbels inclusion introduced an immediate architectural problem. His journal is written in third person, alternating references to himself and Hitler within the same passages. Training on it naively would confuse the model about whose voice it was generating.

Our solution: two special persona tokens — <|hitler|> and <|goebbels|> — prepended to each sequence to condition generation. One 7M parameter model, two temporally-bounded personas.

One caveat we won't minimize: OCR quality was a significant problem. Digitized documents from this era carry noise — undetected characters, formatting artifacts, unremoved footnotes. Our estimate is 10–15% noise in the training corpus. That's substantial, and it shows in edge cases.

The Architecture

7M parameters, Llama-based, trained from scratch. No pretrained weights, no vocabulary contamination from modern text.

Inference format was a deliberate choice. Standard instruction tuning requires question-answer pairs. Authentic historical Q→A data in period German simply doesn't exist at meaningful scale — we had almost none. Rather than generate synthetic pairs and risk poisoning the corpus with modern framing, we use a primer format: the model receives the beginning of a sentence or paragraph and continues it. Mid-sentence primers work best. They drop the model into the rhetorical stream directly, which is exactly what it was trained on.

The Finding: SFT Made It Worse

The most counter-intuitive result of the project.

After pretraining, we applied supervised fine-tuning on a small set of curated Q→A pairs, expecting it to improve coherence and controllability. It did the opposite.

Direct comparison on two key ideological concepts:

Lebensraum

After SFT: one incomplete sentence, stops mid-thought.
Pretrain only: full doctrinal paragraph — soil, blood, political framing, internally consistent.

Volksgemeinschaft

After SFT: grammatically connected, semantically incoherent.
Pretrain only: coherent ideological definition matching period usage.

The explanation is simple in retrospect. We had too few Q→A pairs. The SFT data was too sparse and too noisy to improve the model — it disrupted the pretrain distribution without replacing it with anything better. At small scales with limited curated data, pretrain-only is not a fallback. It's the right choice.

Current Limitations

The model works. It doesn't work well enough yet.

The 7M parameter scale limits coherence on complex topics. OCR noise surfaces as occasional incoherent continuations — the model learned to reproduce the corrupted text it trained on. And there is a persistent geographic bias: because fewer than 1% of corpus tokens relate to non-German contexts, the model reliably pivots any non-German prompt back toward German domestic politics regardless of the primer.

These are data problems, not architecture problems. We're not scaling parameters until the corpus is cleaner.

What's Next

More data, better OCR pipeline, cleaner footnote removal.

The historian portal is in development. Researchers can apply for access — we will evaluate each request individually. No public API. No open weights.

The goal is not a product. It is a research instrument for people who need to understand how this rhetoric was constructed, not just what it said.