A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text

We will construct a Regression Language Model (RLM), a mannequin that predicts steady numerical values immediately from textual content sequences on this coding implementation. Instead of classifying or producing textual content, we concentrate on coaching a transformer-based structure that learns quantitative relationships hidden inside pure language descriptions. We begin by producing artificial text-to-number information, tokenizing it effectively, after which prepare a light-weight Transformer encoder to map linguistic cues to real-valued targets. By the top, we not solely perceive how RLMs may be applied from scratch but in addition visualize their studying conduct and check their generalization on unseen examples. Check out the FULL CODES here.

Copy Code

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.information import Dataset, DataLoader
import matplotlib.pyplot as plt
from collections import Counter
import re


torch.manual_seed(42)
np.random.seed(42)


print(" Regression Language Model (RLM) Tutorial")
print("=" * 60)

We start by importing important libraries, comparable to PyTorch, NumPy, and Matplotlib, to construct and visualize our Regression Language Model. We set random seeds to guarantee reproducibility and initialize the setting, thereby guaranteeing constant outcomes every time the tutorial is run. Check out the FULL CODES here.

Copy Code

def generate_synthetic_data(n_samples=2000):
   """Generate artificial text-to-number regression information"""
  
   templates = [
       ("The temperature is {} degrees", lambda x: x),
       ("I rate this {} out of ten", lambda x: x),
       ("The price is {} dollars", lambda x: x),
       ("Confidence level: {}", lambda x: x / 100),
       ("Speed of {} kilometers per hour", lambda x: x / 10),
       ("{} percent complete", lambda x: x / 100),
       ("Scored {} points in the game", lambda x: x / 10),
       ("The distance is {} meters", lambda x: x),
   ]
  
   information = []
   for _ in vary(n_samples):
       template, rework = templates[np.random.randint(len(templates))]
       worth = np.random.uniform(0, 100)
       textual content = template.format(spherical(worth, 1))
       goal = rework(worth)
       information.append((textual content, goal))
  
   return information

We create a artificial dataset that pairs pure language sentences with corresponding numerical values. By utilizing different templates comparable to temperatures, rankings, and percentages, we make sure the mannequin learns various textual content–quantity relationships. This managed setup helps us simulate practical regression duties with out counting on exterior information. Check out the FULL CODES here.

Copy Code

class SimpleTokenizer:
   def __init__(self):
       self.word2idx = {"<PAD>": 0, "<UNK>": 1}
       self.idx2word = {0: "<PAD>", 1: "<UNK>"}
       self.vocab_size = 2
  
   def match(self, texts):
       """Build vocabulary from texts"""
       phrases = []
       for textual content in texts:
           phrases.prolong(re.findall(r'w+|[^ws]', textual content.decrease()))
      
       word_counts = Counter(phrases)
       for phrase, _ in word_counts.most_common():
           if phrase not in self.word2idx:
               self.word2idx[word] = self.vocab_size
               self.idx2word[self.vocab_size] = phrase
               self.vocab_size += 1
  
   def encode(self, textual content, max_len=20):
       """Convert textual content to token indices"""
       phrases = re.findall(r'w+|[^ws]', textual content.decrease())
       indices = [self.word2idx.get(w, 1) for w in words]
      
       if len(indices) < max_len:
           indices += [0] * (max_len - len(indices))
       else:
           indices = indices[:max_len]
      
       return indices

We design a easy tokenizer to convert uncooked textual content into numerical tokens that the mannequin can course of. It builds a vocabulary from all distinctive phrases and maps every to an index, dealing with unknown phrases and padding mechanically. This step ensures our textual inputs are reworked into constant, machine-readable sequences for coaching. Check out the FULL CODES here.

Copy Code

class RLMDataset(Dataset):
   def __init__(self, information, tokenizer, max_len=20):
       self.information = information
       self.tokenizer = tokenizer
       self.max_len = max_len
  
   def __len__(self):
       return len(self.information)
  
   def __getitem__(self, idx):
       textual content, goal = self.information[idx]
       tokens = self.tokenizer.encode(textual content, self.max_len)
       return torch.tensor(tokens), torch.tensor([target], dtype=torch.float32)


class RegressionLanguageModel(nn.Module):
   def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2,
                dropout=0.1, max_len=20):
       tremendous().__init__()
      
       self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
       self.position_embedding = nn.Embedding(max_len, embed_dim)
      
       encoder_layer = nn.TransformerEncoderLayer(
           d_model=embed_dim,
           nhead=num_heads,
           dim_feedforward=embed_dim * 4,
           dropout=dropout,
           batch_first=True
       )
       self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
      
       self.fc1 = nn.Linear(embed_dim, 64)
       self.relu = nn.ReLU()
       self.dropout = nn.Dropout(dropout)
       self.fc2 = nn.Linear(64, 1)
      
       self.max_len = max_len
  
   def ahead(self, x):
       batch_size, seq_len = x.form
      
       positions = torch.arange(0, seq_len, system=x.system).unsqueeze(0).broaden(batch_size, -1)
      
       token_embed = self.token_embedding(x)
       pos_embed = self.position_embedding(positions)
       embeddings = token_embed + pos_embed
      
       padding_mask = (x == 0)
      
       encoded = self.transformer(embeddings, src_key_padding_mask=padding_mask)
      
       mask_expanded = (~padding_mask).unsqueeze(-1).float()
       summed = (encoded * mask_expanded).sum(dim=1)
       pooled = summed / mask_expanded.sum(dim=1)
      
       x = self.fc1(pooled)
       x = self.relu(x)
       x = self.dropout(x)
       output = self.fc2(x)
      
       return output

We bundle our textual content–quantity pairs into a PyTorch Dataset, the place we tokenize every sentence and return tensors prepared for batching. We then construct a Transformer-based RLM: token and positional embeddings movement via a multi-layer encoder, we mean-pool non-padded tokens, and feed the end result to a small MLP head for regression. In impact, we permit the encoder to be taught numerical cues from language, whereas the pinnacle maps them to a single steady worth. Check out the FULL CODES here.

Copy Code

def train_rlm(mannequin, train_loader, val_loader, epochs=15, lr=0.001):  
   criterion = nn.MSELoss()
   optimizer = optim.Adam(mannequin.parameters(), lr=lr)
  
   train_losses, val_losses = [], []
  
   print(f"n Training on {system}")
   print("-" * 60)
  
   for epoch in vary(epochs):
       mannequin.prepare()
       train_loss = 0
       for tokens, targets in train_loader:
           tokens, targets = tokens.to(system), targets.to(system)
          
           optimizer.zero_grad()
           outputs = mannequin(tokens)
           loss = criterion(outputs, targets)
           loss.backward()
           optimizer.step()
          
           train_loss += loss.merchandise()
      
       train_loss /= len(train_loader)
       train_losses.append(train_loss)
      
       mannequin.eval()
       val_loss = 0
       with torch.no_grad():
           for tokens, targets in val_loader:
               tokens, targets = tokens.to(system), targets.to(system)
               outputs = mannequin(tokens)
               loss = criterion(outputs, targets)
               val_loss += loss.merchandise()
      
       val_loss /= len(val_loader)
       val_losses.append(val_loss)
      
       print(f"Epoch {epoch+1:2nd}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
  
   return train_losses, val_losses

We prepare the mannequin utilizing Adam and MSE loss on a GPU, if out there, iterating over mini-batches to backpropagate and replace weights. We change to analysis mode for validation on the finish of every epoch, monitor coaching and validation losses, and print progress so we are able to see the training dynamics in real-time. Check out the FULL CODES here.

Copy Code

print("n Generating artificial information...")
information = generate_synthetic_data(2000)
split_idx = int(0.8 * len(information))
train_data, val_data = information[:split_idx], information[split_idx:]
print(f"Train samples: {len(train_data)}, Val samples: {len(val_data)}")


print("n Building tokenizer...")
tokenizer = SimpleTokenizer()
tokenizer.match([text for text, _ in train_data])
print(f"Vocabulary dimension: {tokenizer.vocab_size}")


train_dataset = RLMDataset(train_data, tokenizer)
val_dataset = RLMDataset(val_data, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)


print("n Building Regression Language Model...")
mannequin = RegressionLanguageModel(vocab_size=tokenizer.vocab_size)
print(f"Model parameters: {sum(p.numel() for p in mannequin.parameters()):,}")


train_losses, val_losses = train_rlm(mannequin, train_loader, val_loader)


plt.determine(figsize=(10, 4))
plt.plot(train_losses, label='Train Loss', linewidth=2)
plt.plot(val_losses, label='Val Loss', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('RLM Training Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.present()


print("n Testing Predictions:")
print("-" * 60)
test_examples = [
   "The temperature is 25.5 degrees",
   "I rate this 8.0 out of ten",
   "The price is 45.0 dollars",
   "75.0 percent complete"
]


with torch.no_grad():
   for textual content in test_examples:
       tokens = torch.tensor([tokenizer.encode(text)]).to(system)
       prediction = mannequin(tokens).merchandise()
       print(f"Input: {textual content}")
       print(f"Predicted worth: {prediction:.4f}n")


print(" RLM Tutorial Complete!")

We generate and cut up artificial information, match our tokenizer, wrap the whole lot in PyTorch datasets/loaders, and construct the Transformer-based RLM. We prepare the mannequin, visualize loss curves to confirm studying, after which run a few natural-language check prompts to see the expected steady values. With that, we full the end-to-end RLM pipeline.

In conclusion, we efficiently designed, skilled, and evaluated a Regression Language Model able to predicting steady values from textual inputs. We observe how combining positional embeddings, transformer encoders, and a easy regression head permits the mannequin to seize the numerical semantics embedded in language. By producing artificial information, visualizing coaching progress, and testing predictions, we display how RLMs bridge the hole between language understanding and numerical reasoning.

Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t overlook to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text appeared first on MarkTechPost.