|

Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer

Anthropic has by no means revealed a technical paper on Claude Mythos. That has not stopped the analysis group from theorizing. A brand new open-source venture referred to as OpenMythos, launched on GitHub by Kye Gomez, makes an attempt one thing bold: a first-principles theoretical reconstruction of what the Claude Mythos structure may really be, constructed solely in PyTorch and grounded in peer-reviewed analysis.

The venture will not be a leaked mannequin, a fine-tune, or a distillation. It is a speculation rendered in code — and the speculation is particular sufficient to be falsifiable, which is what makes it fascinating.

The Main Claim: Claude Mythos Is a Recurrent-Depth Transformer

OpenMythos proposes that Claude Mythos belongs to a class of architectures referred to as Recurrent-Depth Transformers (RDTs), additionally referred to within the literature as Looped Transformers. The idea is meaningfully totally different from commonplace transformer stacks.

In a standard transformer — GPT, LLaMA, Mistral — the mannequin passes enter via a sequence of distinctive layers, one after one other, every with its personal unbiased weights. More functionality typically means extra layers and extra parameters. In a Recurrent-Depth Transformer, a fastened set of weights is utilized iteratively throughout T loop steps inside a single ahead go. The similar weights run a number of occasions. Reasoning depth will not be a operate of what number of parameters are saved, however of what number of iterations are run at inference time.

Think of it much less like studying a e-book and extra like refining a draft: the mannequin returns to the identical computational block repeatedly, enhancing its inside illustration with every go.

How the Architecture is Structured

OpenMythos instantiates this as a three-part construction: Prelude → Recurrent Block → Coda. The Prelude and Coda are commonplace transformer layers that run precisely as soon as. The Recurrent Block is the computational core, looped as much as T=16 occasions.

At every loop step t, the hidden state is up to date utilizing the next rule:

ht+1 = A·ht + B·e + Transformer(ht, e)

Here, ht is the hidden state after loop iteration t, and e is the encoded enter from the Prelude — re-injected at each step. The re-injection is deliberate: with out it, the hidden state would drift away from the unique enter sign throughout deep loops. The realized matrices A and B govern how a lot of the earlier hidden state and the encoded enter carry ahead at every step.

The FFN contained in the Recurrent Block will not be a commonplace feedforward layer. OpenMythos replaces it with a Mixture-of-Experts (MoE) layer following the design launched in DeepSeekMoE: a giant pool of fine-grained routed consultants, with solely a sparse top-Ok subset activated per token, alongside a small set of always-active shared consultants that take in frequent cross-domain patterns. Crucially, the router selects distinct professional subsets at every loop depth, which means every iteration is computationally distinct regardless of sharing the identical base weights. MoE gives area breadth; looping gives reasoning depth.

Attention defaults to Multi-Latent Attention from DeepSeek-V2, which caches a compressed low-rank KV latent quite than full key/worth tensors, yielding a 10–20× discount in KV reminiscence at manufacturing scale.

Reasoning in Continuous Latent Space

One of a very powerful properties of this structure is that reasoning happens solely in steady latent house. There is not any intermediate token emission between loop steps — the mannequin doesn’t produce textual content mid-thought after which re-read it. This is structurally distinct from chain-of-thought prompting, the place reasoning is externalized as token sequences, and has been formally analyzed in each Saunshi et al. (2025) and COCONUT (2024).

Saunshi et al. (2025) formally present that every loop iteration in an RDT is functionally equal to 1 step of chain-of-thought, however working over real-valued vectors quite than discrete tokens. Continuous latent ideas may also encode a number of various subsequent steps concurrently, enabling one thing nearer to breadth-first search over the reasoning house inside a single ahead go.

This additionally explains a concrete functionality benefit. An ordinary transformer educated on 5-hop reasoning chains fails when examined on 10-hop chains at inference time — it has no mechanism to increase its depth past what it noticed throughout coaching. A Recurrent-Depth Transformer handles this naturally: working extra inference-time loops extends the reasoning chain with none retraining. Harder issues obtain extra compute; easier ones exit early.

Solving the Stability Problem

Training looped fashions has traditionally been brittle. The hidden state ht can develop unboundedly throughout iterations — a failure mode referred to as residual explosion. OpenMythos addresses this utilizing a Linear Time-Invariant (LTI) injection constraint borrowed from the Parcae structure (Prairie et al., 2026): the spectral radius of A, denoted ρ(A), is enforced to be lower than 1 by development, guaranteeing stability regardless of studying fee or gradient noise.

A second failure mode additionally exists on the different excessive: past a sure loop depth, extreme recurrence degrades predictions — the hidden state drifts previous the answer and into noise. This is the ‘overthinking’ drawback. Adaptive Computation Time (ACT) halting addresses it with a realized scalar per place that dynamically decides when to cease looping. Positions which are more durable to course of obtain extra computation; tokens which have already converged halt early.

Finally, Depth-Wise LoRA adapters introduce a small rank-r adaptation matrix at every iteration depth, giving every loop step barely distinct conduct with out including substantial parameters — bridging the hole between pure weight-tying and totally distinct layers.

Why Parameter Efficiency Matters

The Parcae paper (Prairie et al., 2026) gives empirical grounding for the effectivity declare. At 770M parameters, an RDT matches a 1.3B commonplace transformer educated on similar knowledge — roughly half the parameters for equal downstream high quality. Optimal recurrence and optimum token rely each comply with energy legal guidelines with constant exponents throughout scales, establishing the primary predictable scaling legal guidelines for looped coaching.

The implication is critical: reasoning depth scales with inference-time compute, not saved parameter rely. This reframes one of the dominant assumptions within the scaling debate. The related axis will not be parameter rely at coaching, however loop depth at inference.

What OpenMythos Contributes

OpenMythos gives 4 concrete analysis artifacts: a totally configurable PyTorch implementation of the RDT speculation with MoE FFN and Multi-Latent Attention; LTI-stable recurrent injection built-in as a first-class coaching primitive; depth-wise LoRA adapters enabling per-iteration behavioral differentiation; and a reproducible analysis baseline for finding out looped transformer dynamics and inference-time reasoning depth.

Whether or not Mythos is definitely an RDT, OpenMythos offers the analysis group one thing concrete and runnable — an implementation of an structure class the literature more and more suggests is underexplored, and one that will signify a basically totally different path to succesful AI than merely coaching larger fashions.


Check out the Full Codes with Notebook here. Also, be at liberty to comply with us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The submit Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer appeared first on MarkTechPost.

Similar Posts