Building an Advanced Convolutional Neural Network with Attention for DNA Sequence Classification and Interpretability

In this tutorial, we take a hands-on strategy to constructing an superior convolutional neural community for DNA sequence classification. We concentrate on simulating actual organic duties, corresponding to promoter prediction, splice website detection, and regulatory factor identification. By combining one-hot encoding, multi-scale convolutional layers, and an consideration mechanism, we design a mannequin that not solely learns complicated motifs but additionally gives interpretability. As we progress, we generate artificial knowledge, prepare with strong callbacks, and visualize outcomes to make sure we totally perceive the strengths and limitations of our strategy. Check out the FULL CODES here.

Copy Code

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import random


np.random.seed(42)
tf.random.set_seed(42)
random.seed(42)

We start by importing the libraries for deep studying, knowledge dealing with, and visualization. We set random seeds to make sure reproducibility in order that our experiments run constantly every time. Check out the FULL CODES here.

Copy Code

class DNASequenceClassifier:
   def __init__(self, sequence_length=200, num_classes=2):
       self.sequence_length = sequence_length
       self.num_classes = num_classes
       self.mannequin = None
       self.historical past = None
      
   def one_hot_encode(self, sequences):
       mapping = {'A': 0, 'T': 1, 'G': 2, 'C': 3}
       encoded = np.zeros((len(sequences), self.sequence_length, 4))
      
       for i, seq in enumerate(sequences):
           for j, nucleotide in enumerate(seq[:self.sequence_length]):
               if nucleotide in mapping:
                   encoded[i, j, mapping[nucleotide]] = 1
       return encoded
  
   def attention_layer(self, inputs, identify="consideration"):
       attention_weights = layers.Dense(1, activation='tanh', identify=f"{identify}_weights")(inputs)
       attention_weights = layers.Flatten()(attention_weights)
       attention_weights = layers.Activation('softmax', identify=f"{identify}_softmax")(attention_weights)
       attention_weights = layers.RepeatVector(inputs.form[-1])(attention_weights)
       attention_weights = layers.Permute([2, 1])(attention_weights)
      
       attended = layers.Multiply(identify=f"{identify}_multiply")([inputs, attention_weights])
       return layers.GlobalMaxPooling1D()(attended)
  
   def build_model(self):
       inputs = layers.Input(form=(self.sequence_length, 4), identify="dna_input")
      
       conv_layers = []
       filter_sizes = [3, 7, 15, 25]
      
       for i, filter_size in enumerate(filter_sizes):
           conv = layers.Conv1D(
               filters=64,
               kernel_size=filter_size,
               activation='relu',
               padding='similar',
               identify=f"conv_{filter_size}"
           )(inputs)
           conv = layers.BatchNormalization(identify=f"bn_conv_{filter_size}")(conv)
           conv = layers.Dropout(0.2, identify=f"dropout_conv_{filter_size}")(conv)
          
           attended = self.attention_layer(conv, identify=f"attention_{filter_size}")
           conv_layers.append(attended)
      
       if len(conv_layers) > 1:
           merged = layers.Concatenate(identify="concat_multiscale")(conv_layers)
       else:
           merged = conv_layers[0]
      
       dense = layers.Dense(256, activation='relu', identify="dense_1")(merged)
       dense = layers.BatchNormalization(identify="bn_dense_1")(dense)
       dense = layers.Dropout(0.5, identify="dropout_dense_1")(dense)
      
       dense = layers.Dense(128, activation='relu', identify="dense_2")(dense)
       dense = layers.BatchNormalization(identify="bn_dense_2")(dense)
       dense = layers.Dropout(0.3, identify="dropout_dense_2")(dense)
      
       if self.num_classes == 2:
           outputs = layers.Dense(1, activation='sigmoid', identify="output")(dense)
           loss = 'binary_crossentropy'
           metrics = ['accuracy', 'precision', 'recall']
       else:
           outputs = layers.Dense(self.num_classes, activation='softmax', identify="output")(dense)
           loss = 'categorical_crossentropy'
           metrics = ['accuracy']
      
       self.mannequin = keras.Model(inputs=inputs, outputs=outputs, identify="DNA_CNN_Classifier")
      
       optimizer = keras.optimizers.Adam(
           learning_rate=0.001,
           beta_1=0.9,
           beta_2=0.999,
           epsilon=1e-7
       )
      
       self.mannequin.compile(
           optimizer=optimizer,
           loss=loss,
           metrics=metrics
       )
      
       return self.mannequin
  
   def generate_synthetic_data(self, n_samples=10000):
       sequences = []
       labels = []
      
       positive_motifs = ['TATAAA', 'CAAT', 'GGGCGG', 'TTGACA']
       negative_motifs = ['AAAAAAA', 'TTTTTTT', 'CCCCCCC', 'GGGGGGG']
      
       nucleotides = ['A', 'T', 'G', 'C']
      
       for i in vary(n_samples):
           sequence = ''.be a part of(random.selections(nucleotides, ok=self.sequence_length))
          
           if i < n_samples // 2:
               motif = random.alternative(positive_motifs)
               pos = random.randint(0, self.sequence_length - len(motif))
               sequence = sequence[:pos] + motif + sequence[pos + len(motif):]
               label = 1
           else:
               if random.random() < 0.3:
                   motif = random.alternative(negative_motifs)
                   pos = random.randint(0, self.sequence_length - len(motif))
                   sequence = sequence[:pos] + motif + sequence[pos + len(motif):]
               label = 0
          
           sequences.append(sequence)
           labels.append(label)
      
       return sequences, np.array(labels)
  
   def prepare(self, X_train, y_train, X_val, y_val, epochs=50, batch_size=32):
       callbacks = [
           keras.callbacks.EarlyStopping(
               monitor='val_loss',
               patience=10,
               restore_best_weights=True
           ),
           keras.callbacks.ReduceLROnPlateau(
               monitor='val_loss',
               factor=0.5,
               patience=5,
               min_lr=1e-6
           )
       ]
      
       self.historical past = self.mannequin.match(
           X_train, y_train,
           validation_data=(X_val, y_val),
           epochs=epochs,
           batch_size=batch_size,
           callbacks=callbacks,
           verbose=1
       )
      
       return self.historical past
  
   def evaluate_and_visualize(self, X_test, y_test):
       y_pred_proba = self.mannequin.predict(X_test)
       y_pred = (y_pred_proba > 0.5).astype(int).flatten()
      
       print("Classification Report:")
       print(classification_report(y_test, y_pred))
      
       fig, axes = plt.subplots(2, 2, figsize=(15, 10))
      
       axes[0,0].plot(self.historical past.historical past['loss'], label='Training Loss')
       axes[0,0].plot(self.historical past.historical past['val_loss'], label='Validation Loss')
       axes[0,0].set_title('Training History - Loss')
       axes[0,0].set_xlabel('Epoch')
       axes[0,0].set_ylabel('Loss')
       axes[0,0].legend()
      
       axes[0,1].plot(self.historical past.historical past['accuracy'], label='Training Accuracy')
       axes[0,1].plot(self.historical past.historical past['val_accuracy'], label='Validation Accuracy')
       axes[0,1].set_title('Training History - Accuracy')
       axes[0,1].set_xlabel('Epoch')
       axes[0,1].set_ylabel('Accuracy')
       axes[0,1].legend()
      
       cm = confusion_matrix(y_test, y_pred)
       sns.heatmap(cm, annot=True, fmt='d', ax=axes[1,0], cmap='Blues')
       axes[1,0].set_title('Confusion Matrix')
       axes[1,0].set_ylabel('Actual')
       axes[1,0].set_xlabel('Predicted')
      
       axes[1,1].hist(y_pred_proba[y_test==0], bins=50, alpha=0.7, label='Negative', density=True)
       axes[1,1].hist(y_pred_proba[y_test==1], bins=50, alpha=0.7, label='Positive', density=True)
       axes[1,1].set_title('Prediction Score Distribution')
       axes[1,1].set_xlabel('Prediction Score')
       axes[1,1].set_ylabel('Density')
       axes[1,1].legend()
      
       plt.tight_layout()
       plt.present()
      
       return y_pred, y_pred_proba

We outline a DNASequenceClassifier that encodes sequences, learns multi-scale motifs with CNNs, and applies an consideration mechanism for interpretability. We construct and compile the mannequin, generate artificial motif-rich knowledge, and then prepare with strong callbacks and visualize efficiency to guage classification high quality. Check out the FULL CODES here.

Copy Code

def most important():
   print(" Advanced DNA Sequence Classification with CNN")
   print("=" * 50)
  
   classifier = DNASequenceClassifier(sequence_length=200, num_classes=2)
  
   print("Generating artificial DNA sequences...")
   sequences, labels = classifier.generate_synthetic_data(n_samples=10000)
  
   print("Encoding DNA sequences...")
   X = classifier.one_hot_encode(sequences)
  
   X_trn, X_test, y_trn, y_test = train_test_split(
       X, labels, test_size=0.2, random_state=42, stratify=labels
   )
   X_trn, X_val, y_trn, y_val = train_test_split(
       X_trn, y_trn, test_size=0.2, random_state=42, stratify=y_train
   )
  
   print(f"Training set: {X_train.form}")
   print(f"Validation set: {X_val.form}")
   print(f"Test set: {X_test.form}")
  
   print("Building CNN mannequin...")
   mannequin = classifier.build_model()
   print(mannequin.abstract())
  
   print("Training mannequin...")
   classifier.prepare(X_train, y_train, X_val, y_val, epochs=30, batch_size=64)
  
   print("Evaluating mannequin...")
   y_pred, y_pred_proba = classifier.evaluate_and_visualize(X_test, y_test)
  
   print(" Training and analysis full!")


if __name__ == "__main__":
   most important()

We wrap up the workflow in the principle() operate, the place we generate artificial DNA knowledge, encode it, cut up it into coaching, validation, and take a look at units, then construct, prepare, and consider our CNN mannequin. We conclude by visualizing the efficiency and confirming that the classification pipeline runs efficiently from begin to end.

In conclusion, we efficiently show how a fastidiously designed CNN with consideration can classify DNA sequences with excessive accuracy and interpretability. We see how artificial organic motifs assist validate the mannequin’s capability for sample recognition, and how visualization methods present significant insights into coaching dynamics and predictions. Through this journey, we improve our capacity to combine deep studying architectures with organic knowledge, laying the groundwork for making use of these strategies to real-world genomics analysis.

Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

The put up Building an Advanced Convolutional Neural Network with Attention for DNA Sequence Classification and Interpretability appeared first on MarkTechPost.