Meet Talkie-1930: A 13B Open-Weight LLM Trained on Pre-1931 English Text for Historical Reasoning and Generalization Research

What if a language mannequin had by no means heard of the web, smartphones, and even World War II? That’s not a hypothetical — it’s precisely what a workforce of researchers led by Nick Levine, David Duvenaud, and Alec Radford has constructed. They name it talkie, and it could be essentially the most traditionally disciplined giant language mannequin ever launched to the general public.

Talkie is a 13-billion parameter open-weight language mannequin skilled solely on pre-1931 English textual content. The undertaking is developed by a non-profit workforce and introduces what the researchers name a “classic language mannequin” — an LM with a tough data cutoff tied to not when it was skilled, however to a particular second in historical past.

What Exactly Is a Vintage Language Model?

To perceive talkie, you first want to grasp the idea behind it. Most trendy LLMs like GPT-4, LLaMA, Mistral and so forth. are skilled on large crawls of the up to date net. Their data displays the world because it exists immediately, or as of their coaching cutoff date. A classic language mannequin flips this on its head: it’s intentionally skilled solely on historic information in order that its “worldview” is frozen at a specific level previously.

For talkie, that cutoff is December 31, 1930 — chosen exactly as a result of that’s the date when works enter the general public area within the United States, making pre-1931 textual content legally usable for coaching.

The mannequin — formally named talkie-1930-13b-base — was skilled on 260 billion tokens of historic pre-1931 English textual content, together with books, newspapers, periodicals, scientific journals, patents, and case regulation. A individually post-trained conversational checkpoint, talkie-1930-13b-it, can be accessible for interactive use. The workforce has arrange a 24/7 stay demo at talkie-lm.com/chat the place Claude Sonnet 4.6 repeatedly prompts the instruction-tuned mannequin, permitting guests to watch talkie’s voice and data in actual time.

Why a Model From 1930?

This isn’t a nostalgia undertaking. The analysis workforce have recognized a number of concrete, technically significant use instances that make talkie fascinating to the AI analysis group.

1. Contamination-free generalization experiments: Benchmark contamination, the place check information inadvertently leaks into coaching information — is among the most persistent and underappreciated issues in trendy LLM analysis. Because talkie was skilled solely on pre-1931 textual content, it’s contamination-free by development with respect to any trendy benchmark. This opens up a clear experimental setting to check how effectively an LM can generalize past its pre-training information. For instance, the workforce examined whether or not talkie might study Python — a language that didn’t exist in 1930 — by offering a number of in-context demonstration examples. Using the HumanEval benchmark, they discovered that whereas classic fashions dramatically underperform web-trained fashions, they’re “slowly however steadily bettering at this process with scale.”

2. Evaluating forecasting and temporal shock: Inspired by Calcifer Computing’s work on Temporal Language Models, the analysis workforce used talkie to measure the surprisingness (measured in bits per byte) of historic occasion descriptions from the New York Times‘s “On This Day” function. Events after 1930 — talkie’s data cutoff — are persistently extra stunning to the mannequin, with the impact most pronounced for Fifties and Sixties occasions, adopted by a plateau. This creates a principled setup for finding out how forecasting potential scales with mannequin dimension and how efficiency decays over longer temporal horizons.

3. LLM id and persona formation: Because talkie was skilled on a essentially totally different distribution than any trendy mannequin, it opens up questions on what shapes an LLM’s “id.” Modern LLMs — no matter their supplier — all share a standard ancestor in net information, whether or not by means of direct coaching or by means of distillation and artificial information pipelines. Talkie breaks that lineage completely, giving researchers a device to look at what behaviors and capabilities are common to language modeling versus what are artifacts of coaching on the up to date net.

The Training Pipeline: What Makes This Hard

Building a classic language mannequin just isn’t so simple as filtering a contemporary dataset by date. The talkie analysis workforce bumped into a number of non-trivial engineering challenges.

Temporal leakage is essentially the most important. If any post-1930 textual content slips into the coaching corpus — by means of misdated paperwork, or previous texts with anachronistic editorial introductions — the mannequin’s historic constancy is compromised. An earlier 7B model of talkie clearly knew concerning the Roosevelt presidency and New Deal laws, revealing imperfect filtering. The workforce constructed a document-level n-gram-based anachronism classifier to filter the corpus, however acknowledge that is nonetheless imperfect — the 13B model retains some consciousness of World War II and the postwar order.

Data high quality is one other main impediment. Because there was no digital publishing in 1930, each token in talkie’s coaching corpus needed to be transcribed from bodily sources through optical character recognition (OCR). In managed experiments, the workforce discovered that coaching on textual content transcribed by typical OCR techniques yielded solely 30% of the training effectivity of a mannequin skilled on human-transcribed variations of the identical texts. Simple regex cleansing improved that to 70%, however a major hole remained. To shut it, they’re constructing a devoted classic OCR system fine-tuned for historic doc layouts.

Vintage post-training: the instruction-tuning section — required constructing a wholly new pipeline from scratch. Using trendy instruction-response pairs would inject up to date expectations into the mannequin’s conduct. Instead, the workforce generated instruction-response pairs from structured historic texts: etiquette manuals, letter-writing manuals, cookbooks, dictionaries, encyclopedias, and poetry and fable collections. They then ran on-line direct choice optimization (DPO) utilizing Claude Sonnet 4.6 as a decide, bettering talkie’s common instruction-following score from 2.0 to three.4 on a five-point scale. A last spherical of supervised fine-tuning used rejection-sampled multi-turn artificial chats generated between Claude Opus 4.6 and talkie.

Benchmarks: How Does a 1930 Model Stack Up?

To present significant context, the analysis workforce skilled a “trendy twin” — an architecturally similar 13B mannequin skilled on trendy net information (FineWeb) — and in contrast it towards talkie. Unsurprisingly, talkie underperforms its trendy counterpart on commonplace LM evaluations. However, when controlling for query anachronism — filtering out questions that reference ideas that wouldn’t exist in 1930 — the efficiency hole roughly halves. The analysis workforce notes encouraging parity on core language understanding and numeracy duties, and attributes the remaining hole primarily to OCR noise and subject material distribution variations.

Key Takeaways

Talkie is a 13B open-weight “classic language mannequin” skilled on 260 billion tokens of solely pre-1931 English textual content — making it the biggest classic LM recognized, with a tough data cutoff of December 31, 1930.
Benchmark contamination is eradicated by design. Because talkie has by no means seen trendy information, it serves as a uniquely clear testbed for generalization experiments — together with whether or not a mannequin with no data of digital computer systems can study to put in writing Python code from in-context examples alone.
Building a classic LM is tougher than filtering by date. The analysis workforce needed to resolve temporal leakage (post-1930 information slipping in), OCR noise lowering coaching effectivity to only 30% of human-transcribed textual content, and constructing a post-training pipeline completely from pre-1931 sources like etiquette manuals and encyclopedias.
Two checkpoints are publicly accessible underneath Apache 2.0: talkie-1930-13b-base for uncooked completions and talkie-1930-13b-it for dialog — however operating them regionally requires a CUDA GPU with at the very least 28 GB VRAM.
Bigger fashions are coming. The analysis workforce is concentrating on a GPT-3-level classic mannequin by summer season 2026, with a corpus they estimate can scale to over a trillion tokens — doubtlessly sufficient to match the aptitude of the unique ChatGPT, frozen in 1930.