A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing
In this tutorial, we discover the implementation of OpenMythos, a theoretical reconstruction of the Claude Mythos structure that allows deeper reasoning by way of iterative computation somewhat than elevated parameter dimension. We construct and analyze fashions utilizing each GQA and MLA consideration mechanisms, study reminiscence effectivity by way of KV-cache comparisons, and validate stability by way of the spectral properties of the recurrent replace. We then practice the mannequin on a structured parity process and examine how rising loop depth at inference improves efficiency with out retraining. Along the best way, we additionally examine adaptive computation by way of ACT halting and monitor professional utilization within the MoE layers, offering a complete, hands-on understanding of this rising structure.
import subprocess, sys
attempt:
import open_mythos # noqa: F401
besides ImportError:
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q",
"open-mythos"])
import math, time, copy
from collections import Counter, defaultdict
import numpy as np
import torch, torch.nn as nn, torch.nn.purposeful as F
import matplotlib.pyplot as plt
from open_mythos.fundamental import (
OpenMythos, MythosConfig,
ACTHalting, MoEFFN,
)
torch.manual_seed(0); np.random.seed(0)
gadget = "cuda" if torch.cuda.is_available() else "cpu"
print(f"▸ gadget = {gadget} | torch = {torch.__version__}")
def make_config(attn_type: str, *, dim=128, n_heads=4, n_experts=4,
max_loops=8, seq_len=128, vocab=256):
base = dict(
vocab_size=vocab, dim=dim, n_heads=n_heads,
max_seq_len=seq_len, max_loop_iters=max_loops,
prelude_layers=1, coda_layers=1,
n_experts=n_experts, n_shared_experts=1,
n_experts_per_tok=2, expert_dim=dim // 2,
lora_rank=8, attn_type=attn_type,
)
if attn_type == "gqa":
return MythosConfig(**base, n_kv_heads=2)
return MythosConfig(
**base, n_kv_heads=n_heads,
kv_lora_rank=32, q_lora_rank=64,
qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
)
cfg_gqa = make_config("gqa")
cfg_mla = make_config("mla")
m_gqa = OpenMythos(cfg_gqa).to(gadget)
m_mla = OpenMythos(cfg_mla).to(gadget)
print("n─── Part 1 ─ mannequin sizes ──────────────────────────────")
print(f"GQA params : {sum(p.numel() for p in m_gqa.parameters()):>10,}")
print(f"MLA params : {sum(p.numel() for p in m_mla.parameters()):>10,}")
We set up and import all required dependencies and initialize the environment for working OpenMythos. We assemble configurations for each GQA and MLA consideration mechanisms and instantiate their respective fashions. We additionally evaluate their parameter sizes to know how architectural variations influence mannequin scale.
def cache_bytes(kv: dict) -> int:
whole = 0
for entry in kv.values():
for t in entry.values():
whole += t.element_size() * t.numel()
return whole
x = torch.randint(0, 256, (1, 64), gadget=gadget)
ck_gqa, ck_mla = {}, {}
with torch.no_grad():
m_gqa(x, n_loops=4, kv_cache=ck_gqa)
m_mla(x, n_loops=4, kv_cache=ck_mla)
gqa_kb = cache_bytes(ck_gqa) / 1024
mla_kb = cache_bytes(ck_mla) / 1024
print("n─── Part 2 ─ KV-cache footprint (1×64 tokens, 4 loops) ─")
print(f"GQA cache : {gqa_kb:6.2f} KB ({len(ck_gqa)} layer-keys)")
print(f"MLA cache : {mla_kb:6.2f} KB ({len(ck_mla)} layer-keys)")
print(f"ratio : MLA is ≈{gqa_kb / max(mla_kb, 1e-9):.2f}× smaller")
def show_stability(mannequin, tag):
A = mannequin.recurrent.injection.get_A()
print(f"{tag:3s} ρ(A): min={A.min():.4f} max={A.max():.4f} "
f"imply={A.imply():.4f} secure={bool((A < 1).all() and (A > 0).all())}")
print("n─── Part 3 ─ spectral radius at init ──────────────────")
show_stability(m_gqa, "GQA")
show_stability(m_mla, "MLA")
decide = torch.optim.Adam(m_mla.parameters(), lr=1.0)
for _ in vary(30):
loss = m_mla(torch.randint(0, 256, (2, 16), gadget=gadget),
n_loops=2).sq.().imply()
decide.zero_grad(); loss.backward(); decide.step()
show_stability(m_mla, "MLA after abusive coaching (lr=1.0, 30 steps)")
We compute and evaluate the KV-cache reminiscence footprint for each GQA and MLA consideration sorts throughout ahead passes. We then examine the steadiness of the recurrent element by analyzing the spectral radius of matrix A. We additional stress-test the mannequin with excessive coaching situations to substantiate that stability is preserved.
VOCAB = 64
SEQ_LEN = 24
def make_batch(batch=64, seq_len=SEQ_LEN):
x = torch.randint(1, 3, (batch, seq_len), gadget=gadget)
bits = x - 1
parity = bits.cumsum(dim=1) % 2
y = parity + 1
return x, y
cfg = MythosConfig(
vocab_size=VOCAB, dim=64, n_heads=4, n_kv_heads=2,
max_seq_len=SEQ_LEN + 4, max_loop_iters=16,
prelude_layers=1, coda_layers=1,
n_experts=4, n_shared_experts=1, n_experts_per_tok=2,
expert_dim=32, lora_rank=4, attn_type="gqa",
act_threshold=0.99,
)
mannequin = OpenMythos(cfg).to(gadget)
decide = torch.optim.AdamW(mannequin.parameters(), lr=3e-4)
T_TRAIN = 3
print("n─── Part 5 ─ coaching (T_train = 3) ───────────────────")
print(f"params: {sum(p.numel() for p in mannequin.parameters()):,}")
losses = []
t0 = time.time()
for step in vary(600):
x, y = make_batch(64)
logits = mannequin(x, n_loops=T_TRAIN)
loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
decide.zero_grad(); loss.backward()
decide.step()
losses.append(loss.merchandise())
if step % 100 == 0 or step == 599:
with torch.no_grad():
acc = (logits.argmax(-1) == y).float().imply().merchandise()
print(f"step {step:3d} loss={loss.merchandise():.4f} acc@T3={acc:.3f}")
print(f"coaching wallclock: {time.time() - t0:.1f}s")
We outline a cumulative parity process to coach our mannequin on a structured sequential drawback. We initialize the OpenMythos mannequin with a hard and fast loop depth and practice it utilizing cross-entropy loss. Throughout coaching, we monitor loss and accuracy to guage how nicely the mannequin learns beneath constrained depth.
mannequin.eval()
T_sweep = [1, 2, 3, 4, 6, 8, 10, 12, 14, 16]
accs = []
with torch.no_grad():
x_eval, y_eval = make_batch(512)
for T in T_sweep:
logits = mannequin(x_eval, n_loops=T)
accs.append((logits.argmax(-1) == y_eval).float().imply().merchandise())
print("n─── Part 6 ─ depth extrapolation (T_train=3) ──────────")
for T, a in zip(T_sweep, accs):
bar = "█" * int(a * 40)
marker = " ← skilled right here" if T == T_TRAIN else ""
print(f"T={T:2nd} acc={a:.3f} {bar}{marker}")
halt_trace: record[torch.Tensor] = []
orig_halt = mannequin.recurrent.act.ahead
def halt_hook(self, h):
p = orig_halt(h)
halt_trace.append(p.detach().cpu())
return p
mannequin.recurrent.act.ahead = halt_hook.__get__(mannequin.recurrent.act, ACTHalting)
with torch.no_grad():
x_h, _ = make_batch(1)
_ = mannequin(x_h, n_loops=16)
mannequin.recurrent.act.ahead = orig_halt
halts = torch.stack(halt_trace, dim=0)[:, 0].numpy()
print(f"n─── Part 7 ─ ACT halting matrix (loops × positions) ───")
print(f"form: {halts.form} | "
f"imply halt-prob per loop: "
f"{', '.be part of(f'{v:.2f}' for v in halts.imply(1))}")
We consider the skilled mannequin by various the variety of inference loops to check depth extrapolation. We observe how rising loop depth improves accuracy with out retraining the mannequin. We additionally instrument the ACT mechanism to seize halting chances at every sequence place and iteration.
expert_hits = Counter()
orig_moe = mannequin.recurrent.block.ffn.ahead
def moe_hook(self, x):
flat = x.view(-1, x.form[-1])
logits = self.router(flat) + self.router_bias
scores = F.softmax(logits, dim=-1)
_, idx = scores.topk(self.topk, dim=-1)
for e in idx.flatten().tolist():
expert_hits[e] += 1
return orig_moe(x)
mannequin.recurrent.block.ffn.ahead = moe_hook.__get__(
mannequin.recurrent.block.ffn, MoEFFN)
with torch.no_grad():
x_m, _ = make_batch(32)
_ = mannequin(x_m, n_loops=T_TRAIN)
mannequin.recurrent.block.ffn.ahead = orig_moe
print("n─── Part 8 ─ MoE professional utilization ───────────────────")
whole = sum(expert_hits.values())
for eid in vary(cfg.n_experts):
share = expert_hits.get(eid, 0) / max(whole, 1)
print(f"professional {eid}: {share*100:5.2f}% of topk slots")
immediate = torch.tensor([[1, 2, 1, 1, 2, 2, 1, 2]], gadget=gadget)
print("n─── Part 9 ─ era ───────────────────────────────")
print(f"immediate (parity sample): {immediate.tolist()[0]}")
for T_gen in [1, 4, 12]:
with torch.no_grad():
out = mannequin.generate(immediate, max_new_tokens=8,
n_loops=T_gen, temperature=0.1, top_k=2)
print(f"T_gen={T_gen:2nd} → {out.tolist()[0]}")
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].plot(losses)
axes[0].set_title("Training loss (parity process)")
axes[0].set_xlabel("step"); axes[0].set_ylabel("cross-entropy")
axes[0].grid(alpha=0.3)
axes[1].plot(T_sweep, accs, "o-", linewidth=2, markersize=8)
axes[1].axvline(T_TRAIN, coloration="crimson", linestyle="--",
label=f"T_train = {T_TRAIN}")
axes[1].set_title("Depth extrapolation: accuracy vs inference loops")
axes[1].set_xlabel("n_loops at inference"); axes[1].set_ylabel("accuracy")
axes[1].legend(); axes[1].grid(alpha=0.3); axes[1].set_ylim(0, 1.05)
im = axes[2].imshow(halts, facet="auto", cmap="viridis",
vmin=0, vmax=halts.max())
axes[2].set_title("ACT halting probabilityn(loop t × place)")
axes[2].set_xlabel("place"); axes[2].set_ylabel("loop iteration t")
plt.colorbar(im, ax=axes[2], fraction=0.046, pad=0.04)
plt.tight_layout()
plt.savefig("openmythos_tutorial.png", dpi=120, bbox_inches="tight")
plt.present()
We analyze professional utilization within the MoE layer by monitoring how tokens are routed throughout consultants. We then generate sequences at completely different loop depths to watch their results on outputs. Finally, we visualize coaching loss, depth extrapolation efficiency, and ACT halting habits by way of plots.
In conclusion, we demonstrated that OpenMythos successfully leverages looped computation to attain depth extrapolation, enabling the mannequin to enhance accuracy just by rising the variety of inference-time loops. We noticed that the recurrent mechanism stays secure even beneath excessive coaching situations, and that MLA consideration considerably reduces KV-cache reminiscence utilization in comparison with GQA. We additionally noticed how ACT permits dynamic computation throughout sequence positions and how MoE routing distributes workload throughout consultants. Overall, we established that this structure provides a compelling path for compute-adaptive reasoning, the place we commerce further inference compute for higher efficiency with out modifying the mannequin’s parameters.
Check out the Full Codes with Notebook here. Also, be at liberty to observe us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The publish A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing appeared first on MarkTechPost.
