Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning

In this tutorial, we discover OpenMythos by constructing a sophisticated recurrent-depth transformer workflow that runs end-to-end in Google Colab. We create each MLA and GQA mannequin variants, evaluate their parameter counts, and verify the soundness of the recurrent injection matrix via its spectral radius. We then transfer from easy ahead and technology checks into an artificial compositional reasoning activity, the place the mannequin learns to foretell the sum of digit chains modulo a hard and fast worth. Through this setup, we research how recurrent loops allow a single mannequin to reuse its parameters for deeper computation.

Copy Code

import subprocess, sys
def pip(*args):
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", *args], verify=False)
attempt:
   import open_mythos  # noqa: F401
besides Exception:
   pip("open-mythos")
   attempt:
       import open_mythos  # noqa: F401
   besides Exception:
       pip("git+https://github.com/kyegomez/OpenMythos.git")
import math, random, time
import numpy as np
import torch
import torch.nn as nn
import torch.nn.useful as F
from torch.utils.information import Dataset, DataLoader
import matplotlib.pyplot as plt
from open_mythos.principal import OpenMythos, MythosConfig
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
print(f"Device: {gadget} | Torch: {torch.__version__}")

We set up OpenMythos and fall again to the GitHub supply if putting in from PyPI fails. We import the required Python, PyTorch, NumPy, and plotting libraries for mannequin constructing, coaching, and visualization. We additionally set a hard and fast random seed and use CUDA when obtainable, so the tutorial runs effectively in Colab.

Copy Code

def build_model(attn_type: str = "mla", max_loop_iters: int = 8) -> tuple:
   """Build a small OpenMythos mannequin. Two consideration variants supported.
   MLA  — Multi-Latent Attention (compressed KV cache, DeepSearch-V2 fashion)
   GQA  — Grouped-Query Attention (fewer KV heads than Q heads)
   """
   base = dict(
       vocab_size       = 64,
       dim              = 128,
       n_heads          = 4,
       max_seq_len      = 32,
       max_loop_iters   = max_loop_iters,
       prelude_layers   = 1,
       coda_layers      = 1,
       n_experts        = 4,
       n_shared_experts = 1,
       n_experts_per_tok= 2,
       expert_dim       = 64,
       lora_rank        = 8,
       attn_type        = attn_type,
   )
   if attn_type == "gqa":
       cfg = MythosConfig(**base, n_kv_heads=2)
   else:
       cfg = MythosConfig(
           **base, n_kv_heads=4,
           kv_lora_rank=32, q_lora_rank=32,
           qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
       )
   mannequin = OpenMythos(cfg).to(gadget)
   return mannequin, cfg
model_mla, cfg_mla = build_model("mla")
model_gqa, cfg_gqa = build_model("gqa")
def n_params(m): return sum(p.numel() for p in m.parameters())
print(f"n[MLA] params: {n_params(model_mla):>10,}")
print(f"[GQA] params: {n_params(model_gqa):>10,}")
def spectral_radius(mannequin):
   A = mannequin.recurrent.injection.get_A().detach().cpu()
   if A.dim() == 1:
       rho = A.abs().max().merchandise()
   else:
       rho = torch.linalg.eigvals(A.float()).abs().max().merchandise()
   return rho
print(f"nρ(A) MLA: {spectral_radius(model_mla):.4f}   (should be < 1)")
print(f"ρ(A) GQA: {spectral_radius(model_gqa):.4f}   (should be < 1)")
ids = torch.randint(0, cfg_mla.vocab_size, (2, 16), gadget=gadget)
with torch.no_grad():
   logits = model_mla(ids, n_loops=4)
   gen    = model_mla.generate(ids, max_new_tokens=4, n_loops=8)
print(f"nForward logits form:  {tuple(logits.form)}")
print(f"Generation form:      {tuple(gen.form)}")

We outline a reusable mannequin manufacturing unit that builds small OpenMythos fashions with both MLA or GQA consideration. We evaluate each variants by checking their parameter counts and the spectral radius of the recurrent injection matrix. We then run a fast ahead cross and technology check to verify that the MLA mannequin produces logits and generated tokens appropriately.

Copy Code

PAD, START, EQ = 0, 1, 2
DIGIT_BASE     = 10
M              = 7
SEQ_LEN        = cfg_mla.max_seq_len
MIN_LEN, MAX_LEN = 2, 5
def make_example(chain_len: int):
   digits = [random.randint(0, M-1) for _ in range(chain_len)]
   goal = sum(digits) % M
   toks = [START] + [DIGIT_BASE + d for d in digits] + [EQ]
   toks = toks + [PAD] * (SEQ_LEN - len(toks))
   return toks[:SEQ_LEN], DIGIT_BASE + goal
class ChainDataset(Dataset):
   def __init__(self, n, lo, hello):
       self.gadgets = [make_example(random.randint(lo, hi)) for _ in range(n)]
   def __len__(self): return len(self.gadgets)
   def __getitem__(self, i):
       x, y = self.gadgets[i]
       return torch.tensor(x, dtype=torch.lengthy), torch.tensor(y, dtype=torch.lengthy)
train_loader = DataLoader(ChainDataset(3000, MIN_LEN, MAX_LEN), batch_size=64, shuffle=True)
test_loader  = DataLoader(ChainDataset(400,  MIN_LEN, MAX_LEN), batch_size=64)
ood_loader   = DataLoader(ChainDataset(400,  MAX_LEN+1, MAX_LEN+3), batch_size=64)

We create an artificial compositional activity during which the mannequin predicts the sum of digit tokens modulo 7. We outline the token scheme, sequence construction, and dataset class that generates random digit-chain examples. We then construct coaching, check, and out-of-distribution loaders to judge each regular efficiency and depth extrapolation.

Copy Code

mannequin   = model_mla
TRAIN_LOOPS = 4
EPOCHS  = 6
decide   = torch.optim.AdamW(mannequin.parameters(), lr=3e-4, weight_decay=0.01)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(decide, T_max=EPOCHS)
def loss_at_eq(logits, x, y):
   """Predict the reply on the place instantly after the EQ token."""
   eq_pos = (x == EQ).int().argmax(dim=1)
   pred   = logits[torch.arange(x.size(0)), eq_pos]
   return F.cross_entropy(pred, y), pred
train_losses = []
print("n--- Training ---")
t0 = time.time()
for ep in vary(EPOCHS):
   mannequin.prepare(); operating = 0.0
   for x, y in train_loader:
       x, y = x.to(gadget), y.to(gadget)
       logits = mannequin(x, n_loops=TRAIN_LOOPS)
       loss, _ = loss_at_eq(logits, x, y)
       decide.zero_grad(); loss.backward()
       decide.step()
       operating += loss.merchandise()
   sched.step()
   train_losses.append(operating / len(train_loader))
   print(f"  epoch {ep+1}/{EPOCHS}  loss={train_losses[-1]:.4f}  ρ(A)={spectral_radius(mannequin):.3f}")
print(f"Train time: {time.time()-t0:.1f}s")
@torch.no_grad()
def accuracy(loader, n_loops):
   mannequin.eval(); appropriate = whole = 0
   for x, y in loader:
       x, y = x.to(gadget), y.to(gadget)
       logits = mannequin(x, n_loops=n_loops)
       _, pred = loss_at_eq(logits, x, y)
       appropriate += (pred.argmax(-1) == y).sum().merchandise()
       whole   += y.measurement(0)
   return appropriate / whole
LOOP_GRID = [1, 2, 4, 6, 8]
print("n--- Loop-count scaling (identical weights, various compute) ---")
in_dist_acc  = [accuracy(test_loader, L) for L in LOOP_GRID]
ood_acc      = [accuracy(ood_loader,  L) for L in LOOP_GRID]
for L, a, o in zip(LOOP_GRID, in_dist_acc, ood_acc):
   print(f"  n_loops={L}: in-dist acc={a:.3f}   OOD (longer chains) acc={o:.3f}")

We prepare the MLA mannequin with a hard and fast variety of recurrent loops and optimize it with AdamW and a cosine studying fee schedule. We compute the loss on the EQ token place, clip gradients, monitor coaching loss, and monitor recurrent stability after every epoch. We then consider inference-time loop scaling by testing the identical skilled mannequin with completely different loop counts.

Copy Code

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(vary(1, EPOCHS+1), train_losses, marker="o")
axes[0].set_title("Training loss"); axes[0].set_xlabel("epoch"); axes[0].set_ylabel("CE loss")
axes[0].grid(alpha=0.3)
axes[1].plot(LOOP_GRID, in_dist_acc, marker="s", label="in-distribution")
axes[1].plot(LOOP_GRID, ood_acc,     marker="^", label="longer chains (OOD depth)")
axes[1].set_title("Inference-time loop scaling")
axes[1].set_xlabel("# recurrent loops at inference"); axes[1].set_ylabel("check accuracy")
axes[1].legend(); axes[1].grid(alpha=0.3)
plt.tight_layout(); plt.present()
chain_len = 4
toks, true_tok = make_example(chain_len)
digits = [t - DIGIT_BASE for t in toks if t >= DIGIT_BASE]
immediate = torch.tensor([toks], gadget=gadget)
with torch.no_grad():
   gen = mannequin.generate(immediate, max_new_tokens=1, n_loops=8)
predicted = gen[0, -1].merchandise()
print(f"nDemo: digits={digits}, goal=({'+'.be a part of(map(str, digits))}) % {M} = {sum(digits)%M}")
print(f"      true token={true_tok} (digit {true_tok-DIGIT_BASE})  |  "
     f"predicted token={predicted} (digit {predicted-DIGIT_BASE if predicted>=DIGIT_BASE else '?'})")
print("nDone. Key takeaway: at inference, rising n_loops trades compute for")
print("reasoning depth on the identical fixed-parameter mannequin — that is the RDT premise.")

We visualize the coaching loss curve and evaluate in-distribution accuracy with longer-chain out-of-distribution accuracy throughout loop counts. We additionally run a small qualitative technology instance to examine whether or not the skilled mannequin predicts the proper modulo-sum digit. We conclude by displaying that rising the variety of recurrent loops provides the identical fixed-parameter mannequin extra reasoning depth at inference time.

In conclusion, we understood how OpenMythos combines recurrent-depth transformer design, consideration variants, sparse MoE elements, and inference-time loop scaling right into a compact experimental pipeline. We skilled the mannequin on a managed reasoning activity, evaluated it on each in-distribution and longer out-of-distribution chains, and visualized how accuracy modifications as we enhance the variety of recurrent loops. It helped us see how recurrent depth can commerce further inference computation for stronger reasoning habits with out altering the mannequin’s realized parameters.

Check out the Full Codes with Notebook. Also, be at liberty to comply with us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The submit Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning appeared first on MarkTechPost.