A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing

In this tutorial, we discover the implementation of OpenMythos, a theoretical reconstruction of the Claude Mythos structure that allows deeper reasoning by way of iterative computation somewhat than elevated parameter dimension. We construct and analyze fashions utilizing each GQA and MLA consideration mechanisms, study reminiscence effectivity by way of KV-cache comparisons, and validate stability by way of the spectral properties of the recurrent replace. We then practice the mannequin on a structured parity process and examine how rising loop depth at inference improves efficiency with out retraining. Along the best way, we additionally examine adaptive computation by way of ACT halting and monitor professional utilization within the MoE layers, offering a complete, hands-on understanding of this rising structure.

Copy Code

import subprocess, sys
attempt:
   import open_mythos  # noqa: F401
besides ImportError:
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q",
                          "open-mythos"])


import math, time, copy
from collections import Counter, defaultdict


import numpy as np
import torch, torch.nn as nn, torch.nn.purposeful as F
import matplotlib.pyplot as plt


from open_mythos.fundamental import (
   OpenMythos, MythosConfig,
   ACTHalting, MoEFFN,
)


torch.manual_seed(0); np.random.seed(0)
gadget = "cuda" if torch.cuda.is_available() else "cpu"
print(f"▸ gadget = {gadget}   |   torch = {torch.__version__}")


def make_config(attn_type: str, *, dim=128, n_heads=4, n_experts=4,
               max_loops=8, seq_len=128, vocab=256):
   base = dict(
       vocab_size=vocab, dim=dim, n_heads=n_heads,
       max_seq_len=seq_len, max_loop_iters=max_loops,
       prelude_layers=1, coda_layers=1,
       n_experts=n_experts, n_shared_experts=1,
       n_experts_per_tok=2, expert_dim=dim // 2,
       lora_rank=8, attn_type=attn_type,
   )
   if attn_type == "gqa":
       return MythosConfig(**base, n_kv_heads=2)
   return MythosConfig(
       **base, n_kv_heads=n_heads,
       kv_lora_rank=32, q_lora_rank=64,
       qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
   )


cfg_gqa = make_config("gqa")
cfg_mla = make_config("mla")
m_gqa = OpenMythos(cfg_gqa).to(gadget)
m_mla = OpenMythos(cfg_mla).to(gadget)


print("n─── Part 1 ─ mannequin sizes ──────────────────────────────")
print(f"GQA  params : {sum(p.numel() for p in m_gqa.parameters()):>10,}")
print(f"MLA  params : {sum(p.numel() for p in m_mla.parameters()):>10,}")

We set up and import all required dependencies and initialize the environment for working OpenMythos. We assemble configurations for each GQA and MLA consideration mechanisms and instantiate their respective fashions. We additionally evaluate their parameter sizes to know how architectural variations influence mannequin scale.

Copy Code

def cache_bytes(kv: dict) -> int:
   whole = 0
   for entry in kv.values():
       for t in entry.values():
           whole += t.element_size() * t.numel()
   return whole


x = torch.randint(0, 256, (1, 64), gadget=gadget)
ck_gqa, ck_mla = {}, {}
with torch.no_grad():
   m_gqa(x, n_loops=4, kv_cache=ck_gqa)
   m_mla(x, n_loops=4, kv_cache=ck_mla)


gqa_kb = cache_bytes(ck_gqa) / 1024
mla_kb = cache_bytes(ck_mla) / 1024
print("n─── Part 2 ─ KV-cache footprint (1×64 tokens, 4 loops) ─")
print(f"GQA cache : {gqa_kb:6.2f} KB   ({len(ck_gqa)} layer-keys)")
print(f"MLA cache : {mla_kb:6.2f} KB   ({len(ck_mla)} layer-keys)")
print(f"ratio      : MLA is ≈{gqa_kb / max(mla_kb, 1e-9):.2f}× smaller")


def show_stability(mannequin, tag):
   A = mannequin.recurrent.injection.get_A()
   print(f"{tag:3s}  ρ(A): min={A.min():.4f}  max={A.max():.4f}  "
         f"imply={A.imply():.4f}  secure={bool((A < 1).all() and (A > 0).all())}")


print("n─── Part 3 ─ spectral radius at init ──────────────────")
show_stability(m_gqa, "GQA")
show_stability(m_mla, "MLA")


decide = torch.optim.Adam(m_mla.parameters(), lr=1.0)
for _ in vary(30):
   loss = m_mla(torch.randint(0, 256, (2, 16), gadget=gadget),
                n_loops=2).sq.().imply()
   decide.zero_grad(); loss.backward(); decide.step()
show_stability(m_mla, "MLA after abusive coaching (lr=1.0, 30 steps)")

We compute and evaluate the KV-cache reminiscence footprint for each GQA and MLA consideration sorts throughout ahead passes. We then examine the steadiness of the recurrent element by analyzing the spectral radius of matrix A. We additional stress-test the mannequin with excessive coaching situations to substantiate that stability is preserved.

Copy Code

VOCAB = 64
SEQ_LEN = 24


def make_batch(batch=64, seq_len=SEQ_LEN):
   x = torch.randint(1, 3, (batch, seq_len), gadget=gadget)
   bits = x - 1
   parity = bits.cumsum(dim=1) % 2
   y = parity + 1
   return x, y


cfg = MythosConfig(
   vocab_size=VOCAB, dim=64, n_heads=4, n_kv_heads=2,
   max_seq_len=SEQ_LEN + 4, max_loop_iters=16,
   prelude_layers=1, coda_layers=1,
   n_experts=4, n_shared_experts=1, n_experts_per_tok=2,
   expert_dim=32, lora_rank=4, attn_type="gqa",
   act_threshold=0.99,
)
mannequin = OpenMythos(cfg).to(gadget)
decide = torch.optim.AdamW(mannequin.parameters(), lr=3e-4)
T_TRAIN = 3


print("n─── Part 5 ─ coaching (T_train = 3) ───────────────────")
print(f"params: {sum(p.numel() for p in mannequin.parameters()):,}")
losses = []
t0 = time.time()
for step in vary(600):
   x, y = make_batch(64)
   logits = mannequin(x, n_loops=T_TRAIN)
   loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
   decide.zero_grad(); loss.backward()
   decide.step()
   losses.append(loss.merchandise())
   if step % 100 == 0 or step == 599:
       with torch.no_grad():
           acc = (logits.argmax(-1) == y).float().imply().merchandise()
       print(f"step {step:3d}   loss={loss.merchandise():.4f}   acc@T3={acc:.3f}")
print(f"coaching wallclock: {time.time() - t0:.1f}s")

We outline a cumulative parity process to coach our mannequin on a structured sequential drawback. We initialize the OpenMythos mannequin with a hard and fast loop depth and practice it utilizing cross-entropy loss. Throughout coaching, we monitor loss and accuracy to guage how nicely the mannequin learns beneath constrained depth.

Copy Code

mannequin.eval()
T_sweep = [1, 2, 3, 4, 6, 8, 10, 12, 14, 16]
accs = []
with torch.no_grad():
   x_eval, y_eval = make_batch(512)
   for T in T_sweep:
       logits = mannequin(x_eval, n_loops=T)
       accs.append((logits.argmax(-1) == y_eval).float().imply().merchandise())


print("n─── Part 6 ─ depth extrapolation (T_train=3) ──────────")
for T, a in zip(T_sweep, accs):
   bar = "█" * int(a * 40)
   marker = "  ← skilled right here" if T == T_TRAIN else ""
   print(f"T={T:2nd}  acc={a:.3f}  {bar}{marker}")


halt_trace: record[torch.Tensor] = []
orig_halt = mannequin.recurrent.act.ahead


def halt_hook(self, h):
   p = orig_halt(h)
   halt_trace.append(p.detach().cpu())
   return p
mannequin.recurrent.act.ahead = halt_hook.__get__(mannequin.recurrent.act, ACTHalting)


with torch.no_grad():
   x_h, _ = make_batch(1)
   _ = mannequin(x_h, n_loops=16)


mannequin.recurrent.act.ahead = orig_halt


halts = torch.stack(halt_trace, dim=0)[:, 0].numpy()
print(f"n─── Part 7 ─ ACT halting matrix (loops × positions) ───")
print(f"form: {halts.form}  |  "
     f"imply halt-prob per loop: "
     f"{', '.be part of(f'{v:.2f}' for v in halts.imply(1))}")

We consider the skilled mannequin by various the variety of inference loops to check depth extrapolation. We observe how rising loop depth improves accuracy with out retraining the mannequin. We additionally instrument the ACT mechanism to seize halting chances at every sequence place and iteration.

Copy Code

expert_hits = Counter()
orig_moe = mannequin.recurrent.block.ffn.ahead


def moe_hook(self, x):
   flat = x.view(-1, x.form[-1])
   logits = self.router(flat) + self.router_bias
   scores = F.softmax(logits, dim=-1)
   _, idx = scores.topk(self.topk, dim=-1)
   for e in idx.flatten().tolist():
       expert_hits[e] += 1
   return orig_moe(x)


mannequin.recurrent.block.ffn.ahead = moe_hook.__get__(
   mannequin.recurrent.block.ffn, MoEFFN)


with torch.no_grad():
   x_m, _ = make_batch(32)
   _ = mannequin(x_m, n_loops=T_TRAIN)


mannequin.recurrent.block.ffn.ahead = orig_moe


print("n─── Part 8 ─ MoE professional utilization ───────────────────")
whole = sum(expert_hits.values())
for eid in vary(cfg.n_experts):
   share = expert_hits.get(eid, 0) / max(whole, 1)
   print(f"professional {eid}: {share*100:5.2f}% of topk slots")


immediate = torch.tensor([[1, 2, 1, 1, 2, 2, 1, 2]], gadget=gadget)
print("n─── Part 9 ─ era ───────────────────────────────")
print(f"immediate (parity sample): {immediate.tolist()[0]}")
for T_gen in [1, 4, 12]:
   with torch.no_grad():
       out = mannequin.generate(immediate, max_new_tokens=8,
                            n_loops=T_gen, temperature=0.1, top_k=2)
   print(f"T_gen={T_gen:2nd}  → {out.tolist()[0]}")


fig, axes = plt.subplots(1, 3, figsize=(15, 4))


axes[0].plot(losses)
axes[0].set_title("Training loss (parity process)")
axes[0].set_xlabel("step"); axes[0].set_ylabel("cross-entropy")
axes[0].grid(alpha=0.3)


axes[1].plot(T_sweep, accs, "o-", linewidth=2, markersize=8)
axes[1].axvline(T_TRAIN, coloration="crimson", linestyle="--",
               label=f"T_train = {T_TRAIN}")
axes[1].set_title("Depth extrapolation: accuracy vs inference loops")
axes[1].set_xlabel("n_loops at inference"); axes[1].set_ylabel("accuracy")
axes[1].legend(); axes[1].grid(alpha=0.3); axes[1].set_ylim(0, 1.05)


im = axes[2].imshow(halts, facet="auto", cmap="viridis",
                   vmin=0, vmax=halts.max())
axes[2].set_title("ACT halting probabilityn(loop t × place)")
axes[2].set_xlabel("place"); axes[2].set_ylabel("loop iteration t")
plt.colorbar(im, ax=axes[2], fraction=0.046, pad=0.04)


plt.tight_layout()
plt.savefig("openmythos_tutorial.png", dpi=120, bbox_inches="tight")
plt.present()

We analyze professional utilization within the MoE layer by monitoring how tokens are routed throughout consultants. We then generate sequences at completely different loop depths to watch their results on outputs. Finally, we visualize coaching loss, depth extrapolation efficiency, and ACT halting habits by way of plots.

In conclusion, we demonstrated that OpenMythos successfully leverages looped computation to attain depth extrapolation, enabling the mannequin to enhance accuracy just by rising the variety of inference-time loops. We noticed that the recurrent mechanism stays secure even beneath excessive coaching situations, and that MLA consideration considerably reduces KV-cache reminiscence utilization in comparison with GQA. We additionally noticed how ACT permits dynamic computation throughout sequence positions and how MoE routing distributes workload throughout consultants. Overall, we established that this structure provides a compelling path for compute-adaptive reasoning, the place we commerce further inference compute for higher efficiency with out modifying the mannequin’s parameters.

Check out the Full Codes with Notebook here. Also, be at liberty to observe us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The publish A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing appeared first on MarkTechPost.

A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing

MiniMax Releases M2.1: An Enhanced M2 Version with Features like Multi-Coding Language Support, API Integration, and Improved Tools for Structured Coding

Ericsson and AWS bet on AI to create self-healing networks

Implementing a Tool-Enabled Multi-Agent Workflow with Python, OpenAI API, and PrimisAI Nexus

Unlocking your retail insights with LLMs

NVIDIA AI Researchers Release NitroGen: An Open Vision Action Foundation Model For Generalist Gaming Agents

CMU Researchers Introduce Go-Browse: A Graph-Based Framework for Scalable Web Agent Training

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!