How to Implement Functional Components of Transformer and Mini-GPT Model from Scratch Using Tinygrad to Understand Deep Learning Internals

In this tutorial, we discover how to construct neural networks from scratch utilizing Tinygrad whereas remaining absolutely hands-on with tensors, autograd, consideration mechanisms, and transformer architectures. We progressively construct each part ourselves, from fundamental tensor operations to multi-head consideration, transformer blocks, and, lastly, a working mini-GPT mannequin. Through every stage, we observe how Tinygrad’s simplicity helps us perceive what occurs beneath the hood when fashions practice, optimize, and fuse kernels for efficiency. Check out the FULL CODES here.

Copy Code

import subprocess, sys, os
print("Installing dependencies...")
subprocess.check_call(["apt-get", "install", "-qq", "clang"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "git+https://github.com/tinygrad/tinygrad.git"])


import numpy as np
from tinygrad import Tensor, nn, Device
from tinygrad.nn import optim
import time


print(f" Using gadget: {Device.DEFAULT}")
print("=" * 60)


print("n PART 1: Tensor Operations & Autograd")
print("-" * 60)


x = Tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = Tensor([[2.0, 0.0], [1.0, 2.0]], requires_grad=True)


z = (x @ y).sum() + (x ** 2).imply()
z.backward()


print(f"x:n{x.numpy()}")
print(f"y:n{y.numpy()}")
print(f"z (scalar): {z.numpy()}")
print(f"∂z/∂x:n{x.grad.numpy()}")
print(f"∂z/∂y:n{y.grad.numpy()}")

We arrange Tinygrad in our Colab surroundings and instantly start experimenting with tensors and automated differentiation. We create a small computation graph and observe how gradients circulation by way of matrix operations. As we print the outputs, we achieve an intuitive understanding of how Tinygrad handles backpropagation beneath the hood. Check out the FULL CODES here.

Copy Code

print("nn PART 2: Building Custom Layers")
print("-" * 60)


class MultiHeadAttention:
   def __init__(self, dim, num_heads):
       self.num_heads = num_heads
       self.dim = dim
       self.head_dim = dim // num_heads
       self.qkv = Tensor.glorot_uniform(dim, 3 * dim)
       self.out = Tensor.glorot_uniform(dim, dim)
  
   def __call__(self, x):
       B, T, C = x.form[0], x.form[1], x.form[2]
       qkv = x.reshape(B * T, C).dot(self.qkv).reshape(B, T, 3, self.num_heads, self.head_dim)
       q, ok, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
       scale = (self.head_dim ** -0.5)
       attn = (q @ ok.transpose(-2, -1)) * scale
       attn = attn.softmax(axis=-1)
       out = (attn @ v).transpose(1, 2).reshape(B, T, C)
       return out.reshape(B * T, C).dot(self.out).reshape(B, T, C)


class TransformerBlock:
   def __init__(self, dim, num_heads):
       self.attn = MultiHeadAttention(dim, num_heads)
       self.ff1 = Tensor.glorot_uniform(dim, 4 * dim)
       self.ff2 = Tensor.glorot_uniform(4 * dim, dim)
       self.ln1_w = Tensor.ones(dim)
       self.ln2_w = Tensor.ones(dim)
  
   def __call__(self, x):
       x = x + self.attn(self._layernorm(x, self.ln1_w))
       ff = x.reshape(-1, x.form[-1])
       ff = ff.dot(self.ff1).gelu().dot(self.ff2)
       x = x + ff.reshape(x.form)
       return self._layernorm(x, self.ln2_w)
  
   def _layernorm(self, x, w):
       imply = x.imply(axis=-1, keepdim=True)
       var = ((x - imply) ** 2).imply(axis=-1, keepdim=True)
       return w * (x - imply) / (var + 1e-5).sqrt()

We design our personal multi-head consideration module and a transformer block totally from scratch. We implement the projections, consideration scores, softmax, feedforward layers, and layer normalization manually. As we run this code, we see how every part contributes to a transformer layer’s general habits. Check out the FULL CODES here.

Copy Code

print("n PART 3: Mini-GPT Architecture")
print("-" * 60)


class MiniGPT:
   def __init__(self, vocab_size=256, dim=128, num_heads=4, num_layers=2, max_len=32):
       self.vocab_size = vocab_size
       self.dim = dim
       self.tok_emb = Tensor.glorot_uniform(vocab_size, dim)
       self.pos_emb = Tensor.glorot_uniform(max_len, dim)
       self.blocks = [TransformerBlock(dim, num_heads) for _ in range(num_layers)]
       self.ln_f = Tensor.ones(dim)
       self.head = Tensor.glorot_uniform(dim, vocab_size)
  
   def __call__(self, idx):
       B, T = idx.form[0], idx.form[1]
       tok_emb = self.tok_emb[idx.flatten()].reshape(B, T, self.dim)
       pos_emb = self.pos_emb[:T].reshape(1, T, self.dim)
       x = tok_emb + pos_emb
       for block in self.blocks:
           x = block(x)
       imply = x.imply(axis=-1, keepdim=True)
       var = ((x - imply) ** 2).imply(axis=-1, keepdim=True)
       x = self.ln_f * (x - imply) / (var + 1e-5).sqrt()
       return x.reshape(B * T, self.dim).dot(self.head).reshape(B, T, self.vocab_size)
  
   def get_params(self):
       params = [self.tok_emb, self.pos_emb, self.ln_f, self.head]
       for block in self.blocks:
           params.prolong([block.attn.qkv, block.attn.out, block.ff1, block.ff2, block.ln1_w, block.ln2_w])
       return params


mannequin = MiniGPT(vocab_size=256, dim=64, num_heads=4, num_layers=2, max_len=16)
params = mannequin.get_params()
total_params = sum(p.numel() for p in params)
print(f"Model initialized with {total_params:,} parameters")

We assemble the total MiniGPT structure utilizing the elements constructed earlier. We embed tokens, add positional data, stack a number of transformer blocks, and venture the ultimate outputs again to vocab logits. As we initialize the mannequin, we start to respect how a compact transformer will be constructed with surprisingly few shifting elements. Check out the FULL CODES here.

Copy Code

print("nn PART 4: Training Loop")
print("-" * 60)


def gen_data(batch_size, seq_len):
   x = np.random.randint(0, 256, (batch_size, seq_len))
   y = np.roll(x, 1, axis=1)
   y[:, 0] = x[:, 0]
   return Tensor(x, dtype='int32'), Tensor(y, dtype='int32')


optimizer = optim.Adam(params, lr=0.001)
losses = []


print("Training to predict earlier token in sequence...")
with Tensor.practice():
   for step in vary(20):
       begin = time.time()
       x_batch, y_batch = gen_data(batch_size=16, seq_len=16)
       logits = mannequin(x_batch)
       B, T, V = logits.form[0], logits.form[1], logits.form[2]
       loss = logits.reshape(B * T, V).sparse_categorical_crossentropy(y_batch.reshape(B * T))
       optimizer.zero_grad()
       loss.backward()
       optimizer.step()
       losses.append(loss.numpy())
       elapsed = time.time() - begin
       if step % 5 == 0:
           print(f"Step {step:3d} | Loss: {loss.numpy():.4f} | Time: {elapsed*1000:.1f}ms")


print("nn PART 5: Lazy Evaluation & Kernel Fusion")
print("-" * 60)


N = 512
a = Tensor.randn(N, N)
b = Tensor.randn(N, N)


print("Creating computation: (A @ B.T + A).sum()")
lazy_result = (a @ b.T + a).sum()
print("→ No computation accomplished but (lazy analysis)")


print("nCalling .understand() to execute...")
begin = time.time()
realized = lazy_result.understand()
elapsed = time.time() - begin


print(f"✓ Computed in {elapsed*1000:.2f}ms")
print(f"Result: {realized.numpy():.4f}")
print("nNote: Operations had been fused into optimized kernels!")

We practice the MiniGPT mannequin on easy artificial knowledge and observe the loss lowering throughout steps. We additionally discover Tinygrad’s lazy execution mannequin by making a fused kernel that executes solely when it’s realized. As we monitor timings, we perceive how kernel fusion improves efficiency. Check out the FULL CODES here.

Copy Code

print("nn PART 6: Custom Operations")
print("-" * 60)


def custom_activation(x):
   return x * x.sigmoid()


x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]], requires_grad=True)
y = custom_activation(x)
loss = y.sum()
loss.backward()


print(f"Input:    {x.numpy()}")
print(f"Swish(x): {y.numpy()}")
print(f"Gradient: {x.grad.numpy()}")


print("nn" + "=" * 60)
print(" Tutorial Complete!")
print("=" * 60)
print("""
Key Concepts Covered:
1. Tensor operations with automated differentiation
2. Custom neural community layers (Attention, Transformer)
3. Building a mini-GPT language mannequin from scratch
4. Training loop with Adam optimizer
5. Lazy analysis and kernel fusion
6. Custom activation capabilities
""")

We implement a customized activation perform and confirm that gradients propagate appropriately by way of it. We then print a abstract of all main ideas coated within the tutorial. As we end, we mirror on how every part builds our capability to perceive, modify, and prolong deep studying internals utilizing Tinygrad.

In conclusion, we reinforce our understanding of how neural networks actually function beneath fashionable abstractions, and we expertise firsthand how Tinygrad empowers us to tinker with each inner element. We have constructed a transformer, skilled it on artificial knowledge, experimented with lazy analysis and kernel fusion, and even created customized operations, all inside a minimal, clear framework. At final, we acknowledge how this workflow prepares us for deeper experimentation, whether or not we prolong the mannequin, combine actual datasets, or proceed exploring Tinygrad’s low-level capabilities.

Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t overlook to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The submit How to Implement Functional Components of Transformer and Mini-GPT Model from Scratch Using Tinygrad to Understand Deep Learning Internals appeared first on MarkTechPost.