A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning

In this tutorial, we discover the facility of self-supervised studying utilizing the Lightly AI framework. We start by constructing a SimCLR mannequin to be taught significant picture representations with out labels, then generate and visualize embeddings utilizing UMAP and t-SNE. We then dive into coreset choice strategies to curate knowledge intelligently, simulate an energetic studying workflow, and lastly assess the advantages of switch studying by way of a linear probe analysis. Throughout this hands-on information, we work step-by-step in Google Colab, coaching, visualizing, and evaluating coreset-based and random sampling to perceive how self-supervised studying can considerably enhance knowledge effectivity and mannequin efficiency. Check out the FULL CODES here.

Copy Code

!pip uninstall -y numpy
!pip set up numpy==1.26.4
!pip set up -q calmly torch torchvision matplotlib scikit-learn umap-learn


import torch
import torch.nn as nn
import torchvision
from torch.utils.knowledge import DataLoader, Subset
from torchvision import transforms
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.neighbors import NearestNeighbors
import umap


from calmly.loss import NTXentLoss
from calmly.fashions.modules import SimCLRProjectionHead
from calmly.transforms import SimCLRTransform
from calmly.knowledge import LightlyDataset


print(f"PyTorch model: {torch.__version__}")
print(f"CUDA out there: {torch.cuda.is_available()}")

We start by establishing the setting, guaranteeing compatibility by fixing the NumPy model and putting in important libraries like Lightly, PyTorch, and UMAP. We then import all vital modules for constructing, coaching, and visualizing our self-supervised studying mannequin, confirming that PyTorch and CUDA are prepared for GPU acceleration. Check out the FULL CODES here.

Copy Code

class SimCLRModel(nn.Module):
   """SimCLR mannequin with ResNet spine"""
   def __init__(self, spine, hidden_dim=512, out_dim=128):
       tremendous().__init__()
       self.spine = spine
       self.spine.fc = nn.Identity()
       self.projection_head = SimCLRProjectionHead(
           input_dim=512, hidden_dim=hidden_dim, output_dim=out_dim
       )
  
   def ahead(self, x):
       options = self.spine(x).flatten(start_dim=1)
       z = self.projection_head(options)
       return z
  
   def extract_features(self, x):
       """Extract spine options with out projection"""
       with torch.no_grad():
           return self.spine(x).flatten(start_dim=1)

We outline our SimCLRModel, which makes use of a ResNet spine to be taught visible representations with out labels. We take away the classification head and add a projection head to map options right into a contrastive embedding area. The mannequin’s extract_features technique permits us to get hold of uncooked characteristic embeddings instantly from the spine for downstream evaluation. Check out the FULL CODES here.

Copy Code

def load_dataset(practice=True):
   """Load CIFAR-10 dataset"""
   ssl_transform = SimCLRTransform(input_size=32, cj_prob=0.8)
  
   eval_transform = transforms.Compose([
       transforms.ToTensor(),
       transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
   ])
  
   base_dataset = torchvision.datasets.CIFAR10(
       root='./knowledge', practice=practice, obtain=True
   )
  
   class SSLDataset(torch.utils.knowledge.Dataset):
       def __init__(self, dataset, remodel):
           self.dataset = dataset
           self.remodel = remodel
      
       def __len__(self):
           return len(self.dataset)
      
       def __getitem__(self, idx):
           img, label = self.dataset[idx]
           return self.remodel(img), label
  
   ssl_dataset = SSLDataset(base_dataset, ssl_transform)
  
   eval_dataset = torchvision.datasets.CIFAR10(
       root='./knowledge', practice=practice, obtain=True, remodel=eval_transform
   )
  
   return ssl_dataset, eval_dataset

In this step, we load the CIFAR-10 dataset and apply separate transformations for self-supervised and analysis phases. We create a customized SSLDataset class that generates a number of augmented views of every picture for contrastive studying, whereas the analysis dataset makes use of normalized pictures for downstream duties. This setup helps the mannequin be taught strong representations invariant to visible modifications. Check out the FULL CODES here.

Copy Code

def train_ssl_model(mannequin, dataloader, epochs=5, machine='cuda'):
   """Train SimCLR mannequin"""
   mannequin.to(machine)
   criterion = NTXentLoss(temperature=0.5)
   optimizer = torch.optim.SGD(mannequin.parameters(), lr=0.06, momentum=0.9, weight_decay=5e-4)
  
   print("n=== Self-Supervised Training ===")
   for epoch in vary(epochs):
       mannequin.practice()
       total_loss = 0
       for batch_idx, batch in enumerate(dataloader):
           views = batch[0] 
           view1, view2 = views[0].to(machine), views[1].to(machine)
          
           z1 = mannequin(view1)
           z2 = mannequin(view2)
           loss = criterion(z1, z2)
          
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()
          
           total_loss += loss.merchandise()
          
           if batch_idx % 50 == 0:
               print(f"Epoch {epoch+1}/{epochs} | Batch {batch_idx} | Loss: {loss.merchandise():.4f}")
      
       avg_loss = total_loss / len(dataloader)
       print(f"Epoch {epoch+1} Complete | Avg Loss: {avg_loss:.4f}")
  
   return mannequin

Here, we practice our SimCLR mannequin in a self-supervised method utilizing the NT-Xent contrastive loss, which inspires comparable representations for augmented views of the identical picture. We optimize the mannequin with stochastic gradient descent (SGD) and observe the loss throughout epochs to monitor studying progress. This stage teaches the mannequin to extract significant visible options with out counting on labeled knowledge. Check out the FULL CODES here.

Copy Code

def generate_embeddings(mannequin, dataset, machine='cuda', batch_size=256):
   """Generate embeddings for all the dataset"""
   mannequin.eval()
   mannequin.to(machine)
  
   dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=2)
  
   embeddings = []
   labels = []
  
   print("n=== Generating Embeddings ===")
   with torch.no_grad():
       for pictures, targets in dataloader:
           pictures = pictures.to(machine)
           options = mannequin.extract_features(pictures)
           embeddings.append(options.cpu().numpy())
           labels.append(targets.numpy())
  
   embeddings = np.vstack(embeddings)
   labels = np.concatenate(labels)
  
   print(f"Generated {embeddings.form[0]} embeddings with dimension {embeddings.form[1]}")
   return embeddings, labels


def visualize_embeddings(embeddings, labels, technique='umap', n_samples=5000):
   """Visualize embeddings utilizing UMAP or t-SNE"""
   print(f"n=== Visualizing Embeddings with {technique.higher()} ===")
  
   if len(embeddings) > n_samples:
       indices = np.random.selection(len(embeddings), n_samples, change=False)
       embeddings = embeddings[indices]
       labels = labels[indices]
  
   if technique == 'umap':
       reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine')
   else:
       reducer = TSNE(n_components=2, perplexity=30, metric='cosine')
  
   embeddings_2d = reducer.fit_transform(embeddings)
  
   plt.determine(figsize=(12, 10))
   scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1],
                         c=labels, cmap='tab10', s=5, alpha=0.6)
   plt.colorbar(scatter)
   plt.title(f'CIFAR-10 Embeddings ({technique.higher()})')
   plt.xlabel('Component 1')
   plt.ylabel('Component 2')
   plt.tight_layout()
   plt.savefig(f'embeddings_{technique}.png', dpi=150)
   print(f"Saved visualization to embeddings_{technique}.png")
   plt.present()


def select_coreset(embeddings, labels, finances=1000, technique='variety'):
   """
   Select a coreset utilizing totally different methods:
   - variety: Maximum variety utilizing k-center grasping
   - balanced: Class-balanced choice
   """
   print(f"n=== Coreset Selection ({technique}) ===")
  
   if technique == 'balanced':
       selected_indices = []
       n_classes = len(np.distinctive(labels))
       per_class = finances // n_classes
      
       for cls in vary(n_classes):
           cls_indices = np.the place(labels == cls)[0]
           chosen = np.random.selection(cls_indices, min(per_class, len(cls_indices)), change=False)
           selected_indices.prolong(chosen)
      
       return np.array(selected_indices)
  
   elif technique == 'variety':
       selected_indices = []
       remaining_indices = set(vary(len(embeddings)))
      
       first_idx = np.random.randint(len(embeddings))
       selected_indices.append(first_idx)
       remaining_indices.take away(first_idx)
      
       for _ in vary(finances - 1):
           if not remaining_indices:
               break
          
           remaining = record(remaining_indices)
           selected_emb = embeddings[selected_indices]
           remaining_emb = embeddings[remaining]
          
           distances = np.min(
               np.linalg.norm(remaining_emb[:, None] - selected_emb, axis=2), axis=1
           )
          
           max_dist_idx = np.argmax(distances)
           selected_idx = remaining[max_dist_idx]
           selected_indices.append(selected_idx)
           remaining_indices.take away(selected_idx)
      
       print(f"Selected {len(selected_indices)} samples")
       return np.array(selected_indices)

We extract high-quality characteristic embeddings from our skilled spine, cache them with labels, and challenge them to 2D utilizing UMAP or t-SNE to visually see the cluster construction emerge. Next, we curate knowledge utilizing a coreset selector, both class-balanced or diversity-driven (k-center grasping), to prioritize essentially the most informative, non-redundant samples for downstream coaching. This pipeline helps us each see what the mannequin learns and choose what issues most. Check out the FULL CODES here.

Copy Code

def evaluate_linear_probe(mannequin, train_subset, test_dataset, machine='cuda'):
   """Train linear classifier on frozen options"""
   mannequin.eval()
  
   train_loader = DataLoader(train_subset, batch_size=128, shuffle=True, num_workers=2)
   test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False, num_workers=2)
  
   classifier = nn.Linear(512, 10).to(machine)
   criterion = nn.CrossEntropyLoss()
   optimizer = torch.optim.Adam(classifier.parameters(), lr=0.001)
  
   for epoch in vary(10):
       classifier.practice()
       for pictures, targets in train_loader:
           pictures, targets = pictures.to(machine), targets.to(machine)
          
           with torch.no_grad():
               options = mannequin.extract_features(pictures)
          
           outputs = classifier(options)
           loss = criterion(outputs, targets)
          
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()
  
   classifier.eval()
   appropriate = 0
   complete = 0
  
   with torch.no_grad():
       for pictures, targets in test_loader:
           pictures, targets = pictures.to(machine), targets.to(machine)
           options = mannequin.extract_features(pictures)
           outputs = classifier(options)
           _, predicted = outputs.max(1)
           complete += targets.measurement(0)
           appropriate += predicted.eq(targets).sum().merchandise()
  
   accuracy = 100. * appropriate / complete
   return accuracy


def principal():
   machine = 'cuda' if torch.cuda.is_available() else 'cpu'
   print(f"Using machine: {machine}")
  
   ssl_dataset, eval_dataset = load_dataset(practice=True)
   _, test_dataset = load_dataset(practice=False)
  
   ssl_subset = Subset(ssl_dataset, vary(10000)) 
   ssl_loader = DataLoader(ssl_subset, batch_size=128, shuffle=True, num_workers=2, drop_last=True)
  
   spine = torchvision.fashions.resnet18(pretrained=False)
   mannequin = SimCLRModel(spine)
   mannequin = train_ssl_model(mannequin, ssl_loader, epochs=5, machine=machine)
  
   eval_subset = Subset(eval_dataset, vary(10000))
   embeddings, labels = generate_embeddings(mannequin, eval_subset, machine=machine)
  
   visualize_embeddings(embeddings, labels, technique='umap')
  
   coreset_indices = select_coreset(embeddings, labels, finances=1000, technique='variety')
   coreset_subset = Subset(eval_dataset, coreset_indices)
  
   print("n=== Active Learning Evaluation ===")
   coreset_acc = evaluate_linear_probe(mannequin, coreset_subset, test_dataset, machine=machine)
   print(f"Coreset Accuracy (1000 samples): {coreset_acc:.2f}%")
  
   random_indices = np.random.selection(len(eval_subset), 1000, change=False)
   random_subset = Subset(eval_dataset, random_indices)
   random_acc = evaluate_linear_probe(mannequin, random_subset, test_dataset, machine=machine)
   print(f"Random Accuracy (1000 samples): {random_acc:.2f}%")
  
   print(f"nCoreset enchancment: +{coreset_acc - random_acc:.2f}%")
  
   print("n=== Tutorial Complete! ===")
   print("Key takeaways:")
   print("1. Self-supervised studying creates significant representations with out labels")
   print("2. Embeddings seize semantic similarity between pictures")
   print("3. Smart knowledge choice (coreset) outperforms random sampling")
   print("4. Active studying reduces labeling prices whereas sustaining accuracy")


if __name__ == "__main__":
   principal()

We freeze the spine and practice a light-weight linear probe to quantify how good our discovered options are, then consider accuracy on the check set. In the principle pipeline, we pretrain with SimCLR, generate embeddings, visualize them, decide a various coreset, and evaluate linear-probe efficiency towards a random subset, thereby instantly measuring the worth of sensible knowledge curation.

In conclusion, we’ve got seen how self-supervised studying allows illustration studying with out guide annotations and how coreset-based knowledge choice enhances mannequin generalization with fewer samples. By coaching a SimCLR mannequin, producing embeddings, curating knowledge, and evaluating by way of energetic studying, we expertise the end-to-end course of of recent self-supervised workflows. We conclude that by combining clever knowledge curation with discovered representations, we will construct fashions which can be each resource-efficient and performance-optimized, setting a powerful basis for scalable machine studying purposes.

Check out the FULL CODES here. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t overlook to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The submit A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning appeared first on MarkTechPost.