How to Create a Bioinformatics AI Agent Using Biopython for DNA and Protein Analysis

In this tutorial, we show how to construct a sophisticated but accessible Bioinformatics AI Agent utilizing Biopython and common Python libraries, designed to run seamlessly in Google Colab. By combining sequence retrieval, molecular evaluation, visualization, a number of sequence alignment, phylogenetic tree development, and motif searches into a single streamlined class, the tutorial offers a hands-on method to discover the complete spectrum of organic sequence evaluation. Users can begin with built-in pattern sequences such because the SARS-CoV-2 Spike protein, Human Insulin precursor, and E. coli 16S rRNA, or fetch customized sequences straight from NCBI. With built-in visualization instruments powered by Plotly and Matplotlib, researchers and college students alike can shortly carry out complete DNA and protein analyses while not having prior setup past a Colab pocket book. Check out the FULL CODES here.

Copy Code

!pip set up biopython pandas numpy matplotlib seaborn plotly requests beautifulsoup4 scipy scikit-learn networkx
!apt-get replace
!apt-get set up -y clustalw


import os
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.specific as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from Bio import SeqIO, Entrez, Align, Phylo
from Bio.Seq import Seq
from Bio.SeqDocument import SeqDocument
from Bio.SeqUtils import gc_fraction, molecular_weight
from Bio.SeqUtils.ProtParam import ProteinAnalysis
from Bio.Blast import NCBIWWW, NCBIXML
from Bio.Phylo.TreeDevelopment import DistanceCalculator, DistanceTreeConstructor
import warnings
warnings.filterwarnings('ignore')


Entrez.e-mail = "[email protected]"

We start by putting in important bioinformatics and information science libraries, together with ClustalW for sequence alignment. We then import Biopython modules, visualization instruments, and supporting packages, whereas organising Entrez with our e-mail to fetch sequences from NCBI. This ensures our Colab surroundings is totally ready for superior sequence evaluation. Check out the FULL CODES here.

Copy Code

class BioPythonAIAgent:
   def __init__(self, e-mail="[email protected]"):
       self.e-mail = e-mail
       Entrez.e-mail = e-mail
       self.sequences = {}
       self.analysis_results = {}
       self.alignments = {}
       self.timber = {}
  
   def fetch_sequence_from_ncbi(self, accession_id, db="nucleotide", rettype="fasta"):
       attempt:
           deal with = Entrez.efetch(db=db, id=accession_id, rettype=rettype, retmode="textual content")
           report = SeqIO.learn(deal with, "fasta")
           deal with.shut()
           self.sequences[accession_id] = report
           return report
       besides Exception as e:
           print(f"Error fetching sequence: {str(e)}")
           return None
  
   def create_sample_sequences(self):
       covid_spike = "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT"
      
       human_insulin = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
      
       e_coli_16s = "AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAATGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGCGTTAAGGTTAATAACCTTGGCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA"
      
       sample_sequences = [
           ("COVID_Spike", covid_spike, "SARS-CoV-2 Spike Protein"),
           ("Human_Insulin", human_insulin, "Human Insulin Precursor"),
           ("E_coli_16S", e_coli_16s, "E. coli 16S rRNA")
       ]
      
       for seq_id, seq_str, desc in sample_sequences:
           report = SeqDocument(Seq(seq_str), id=seq_id, description=desc)
           self.sequences[seq_id] = report
      
       return sample_sequences
  
   def analyze_sequence(self, sequence_id=None, sequence=None):
       if sequence_id and sequence_id in self.sequences:
           seq_record = self.sequences[sequence_id]
           seq = seq_record.seq
           description = seq_record.description
       elif sequence:
           seq = Seq(sequence)
           description = "Custom sequence"
       else:
           return None
      
       evaluation = {
           'size': len(seq),
           'composition': {}
       }
      
       for base in ['A', 'T', 'G', 'C']:
           evaluation['composition'][base] = seq.rely(base)
      
       if 'A' in evaluation['composition'] and 'T' in evaluation['composition']:
           evaluation['gc_content'] = spherical(gc_fraction(seq) * 100, 2)
           attempt:
               evaluation['molecular_weight'] = spherical(molecular_weight(seq, seq_type='DNA'), 2)
           besides:
               evaluation['molecular_weight'] = len(seq) * 650
      
       attempt:
           if len(seq) % 3 == 0:
               protein = seq.translate()
               evaluation['translation'] = str(protein)
               evaluation['stop_codons'] = protein.rely('*')
              
               if '*' not in str(protein)[:-1]:
                   prot_analysis = ProteinAnalysis(str(protein)[:-1])
                   evaluation['protein_mw'] = spherical(prot_analysis.molecular_weight(), 2)
                   evaluation['isoelectric_point'] = spherical(prot_analysis.isoelectric_point(), 2)
                   evaluation['protein_composition'] = prot_analysis.get_amino_acids_percent()
       besides:
           cross
      
       key = sequence_id if sequence_id else "customized"
       self.analysis_results[key] = evaluation
      
       return evaluation
  
   def visualize_composition(self, sequence_id):
       if sequence_id not in self.analysis_results:
           return
      
       evaluation = self.analysis_results[sequence_id]
      
       fig = make_subplots(
           rows=2, cols=2,
           specs=[[{"type": "pie"}, {"type": "bar"}],
                  [{"colspan": 2}, None]],
           subplot_titles=("Nucleotide Composition", "Base Count", "Sequence Properties")
       )
      
       labels = checklist(evaluation['composition'].keys())
       values = checklist(evaluation['composition'].values())
      
       fig.add_trace(
           go.Pie(labels=labels, values=values, identify="Composition"),
           row=1, col=1
       )
      
       fig.add_trace(
           go.Bar(x=labels, y=values, identify="Count", marker_color=['red', 'blue', 'green', 'orange']),
           row=1, col=2
       )
      
       properties = ['Length', 'GC%', 'MW (kDa)']
       prop_values = [
           analysis['length'],
           evaluation.get('gc_content', 0),
           evaluation.get('molecular_weight', 0) / 1000
       ]
      
       fig.add_trace(
           go.Scatter(x=properties, y=prop_values, mode='markers+strains',
                     marker=dict(measurement=10, colour='purple'), identify="Properties"),
           row=2, col=1
       )
      
       fig.update_layout(
           title=f"Comprehensive Analysis: {sequence_id}",
           showlegend=False,
           top=600
       )
      
       fig.present()
  
   def perform_multiple_sequence_alignment(self, sequence_ids):
       if len(sequence_ids) < 2:
           return None
      
       sequences = []
       for seq_id in sequence_ids:
           if seq_id in self.sequences:
               sequences.append(self.sequences[seq_id])
      
       if len(sequences) < 2:
           return None
      
       from Bio.Align import PairwiseAligner
       aligner = PairwiseAligner()
       aligner.match_score = 2
       aligner.mismatch_score = -1
       aligner.open_gap_score = -2
       aligner.extend_gap_score = -0.5
      
       alignments = []
       for i in vary(len(sequences)):
           for j in vary(i+1, len(sequences)):
               alignment = aligner.align(sequences[i].seq, sequences[j].seq)[0]
               alignments.append(alignment)
      
       return alignments
  
   def create_phylogenetic_tree(self, alignment_key=None, sequences=None):
       if alignment_key and alignment_key in self.alignments:
           alignment = self.alignments[alignment_key]
       elif sequences:
           information = []
           for i, seq in enumerate(sequences):
               report = SeqDocument(Seq(seq), id=f"seq_{i}")
               information.append(report)
           SeqIO.write(information, "temp.fasta", "fasta")
          
           attempt:
               clustalw_cline = ClustalwCommandline("clustalw2", infile="temp.fasta")
               stdout, stderr = clustalw_cline()
               alignment = AlignIO.learn("temp.aln", "clustal")
               os.take away("temp.fasta")
               os.take away("temp.aln")
               os.take away("temp.dnd")
           besides:
               return None
       else:
           return None
      
       calculator = DistanceCalculator('identification')
       dm = calculator.get_distance(alignment)
      
       constructor = DistanceTreeConstructor()
       tree = constructor.upgma(dm)
      
       tree_key = f"tree_{len(self.timber)}"
       self.timber[tree_key] = tree
      
       return tree
  
   def visualize_tree(self, tree):
       fig, ax = plt.subplots(figsize=(10, 6))
       Phylo.draw(tree, axes=ax)
       plt.title("Phylogenetic Tree")
       plt.tight_layout()
       plt.present()
  
   def protein_structure_analysis(self, sequence_id):
       if sequence_id not in self.sequences:
           return None
      
       seq = self.sequences[sequence_id].seq
      
       attempt:
           if len(seq) % 3 == 0:
               protein = seq.translate()
               if '*' not in str(protein)[:-1]:
                   prot_analysis = ProteinAnalysis(str(protein)[:-1])
                  
                   structure_analysis = {
                       'molecular_weight': prot_analysis.molecular_weight(),
                       'isoelectric_point': prot_analysis.isoelectric_point(),
                       'amino_acid_percent': prot_analysis.get_amino_acids_percent(),
                       'secondary_structure': prot_analysis.secondary_structure_fraction(),
                       'flexibility': prot_analysis.flexibility(),
                       'gravy': prot_analysis.gravy()
                   }
                  
                   return structure_analysis
       besides:
           cross
      
       return None
  
   def comparative_analysis(self, sequence_ids):
       outcomes = []
      
       for seq_id in sequence_ids:
           if seq_id in self.analysis_results:
               evaluation = self.analysis_results[seq_id].copy()
               evaluation['sequence_id'] = seq_id
               outcomes.append(evaluation)
      
       df = pd.DataFrame(outcomes)
      
       if len(df) > 1:
           fig = make_subplots(
               rows=2, cols=2,
               subplot_titles=("Length Comparison", "GC Content", "Molecular Weight", "Composition Heatmap")
           )
          
           fig.add_trace(
               go.Bar(x=df['sequence_id'], y=df['length'], identify="Length"),
               row=1, col=1
           )
          
           if 'gc_content' in df.columns:
               fig.add_trace(
                   go.Scatter(x=df['sequence_id'], y=df['gc_content'], mode='markers+strains', identify="GC%"),
                   row=1, col=2
               )
          
           if 'molecular_weight' in df.columns:
               fig.add_trace(
                   go.Bar(x=df['sequence_id'], y=df['molecular_weight'], identify="MW"),
                   row=2, col=1
               )
          
           fig.update_layout(title="Comparative Sequence Analysis", top=600)
           fig.present()
      
       return df
  
   def codon_usage_analysis(self, sequence_id):
       if sequence_id not in self.sequences:
           return None
      
       seq = self.sequences[sequence_id].seq
      
       if len(seq) % 3 != 0:
           return None
      
       codons = {}
       for i in vary(0, len(seq) - 2, 3):
           codon = str(seq[i:i+3])
           codons[codon] = codons.get(codon, 0) + 1
      
       codon_df = pd.DataFrame(checklist(codons.gadgets()), columns=['Codon', 'Count'])
       codon_df = codon_df.sort_values('Count', ascending=False)
      
       fig = px.bar(codon_df.head(20), x='Codon', y='Count',
                    title=f"Top 20 Codon Usage - {sequence_id}")
       fig.present()
      
       return codon_df
  
   def motif_search(self, sequence_id, motif_pattern):
       if sequence_id not in self.sequences:
           return []
      
       seq = str(self.sequences[sequence_id].seq)
       positions = []
      
       for i in vary(len(seq) - len(motif_pattern) + 1):
           if seq[i:i+len(motif_pattern)] == motif_pattern:
               positions.append(i)
      
       return positions
  
   def gc_content_window(self, sequence_id, window_size=100):
       if sequence_id not in self.sequences:
           return None
      
       seq = self.sequences[sequence_id].seq
       gc_values = []
       positions = []
      
       for i in vary(0, len(seq) - window_size + 1, window_size//4):
           window = seq[i:i+window_size]
           gc_values.append(gc_fraction(window) * 100)
           positions.append(i + window_size//2)
      
       fig = go.Figure()
       fig.add_trace(go.Scatter(x=positions, y=gc_values, mode='strains+markers',
                               identify=f'GC Content (window={window_size})'))
       fig.update_layout(
           title=f"GC Content Sliding Window Analysis - {sequence_id}",
           xaxis_title="Position",
           yaxis_title="GC Content (%)"
       )
       fig.present()
      
       return positions, gc_values
  
   def run_comprehensive_analysis(self, sequence_ids):
       outcomes = {}
      
       for seq_id in sequence_ids:
           if seq_id in self.sequences:
               evaluation = self.analyze_sequence(seq_id)
               self.visualize_composition(seq_id)
              
               gc_analysis = self.gc_content_window(seq_id)
               codon_analysis = self.codon_usage_analysis(seq_id)
              
               outcomes[seq_id] = {
                   'basic_analysis': evaluation,
                   'gc_window': gc_analysis,
                   'codon_usage': codon_analysis
               }
      
       if len(sequence_ids) > 1:
           comparative_df = self.comparative_analysis(sequence_ids)
           outcomes['comparative'] = comparative_df
      
       return outcomes

We outline a BioPython AIAgent that permits us to fetch or create sequences, run core analyses (composition, GC%, translation, and protein properties), and visualize outcomes interactively. We additionally carry out pairwise alignments, construct phylogenetic timber, scan motifs, profile codon utilization, analyze GC with sliding home windows, and evaluate a number of sequences—then bundle every part into one complete pipeline. Check out the FULL CODES here.

Copy Code

agent = BioPythonAIAgent()


sample_seqs = agent.create_sample_sequences()


for seq_id, _, _ in sample_seqs:
   agent.analyze_sequence(seq_id)


outcomes = agent.run_comprehensive_analysis(['COVID_Spike', 'Human_Insulin', 'E_coli_16S'])


print("BioPython AI Agent Tutorial Complete!")
print("Available sequences:", checklist(agent.sequences.keys()))
print("Available strategies:", [method for method in dir(agent) if not method.startswith('_')])

We instantiate the BioPythonAIAgent, generate pattern sequences (COVID Spike, Human Insulin, and E. coli 16S), and run a full evaluation pipeline. The outputs affirm that our agent efficiently performs nucleotide, codon, and GC-content analyses whereas additionally making ready comparative visualizations. Finally, we print the checklist of accessible sequences and supported strategies, indicating that the agent’s full analytical capabilities are actually prepared for use. Check out the FULL CODES here.

Copy Code

agent.visualize_composition('COVID_Spike')
agent.gc_content_window('E_coli_16S', window_size=50)
agent.codon_usage_analysis('COVID_Spike')


comparative_df = agent.comparative_analysis(['COVID_Spike', 'Human_Insulin', 'E_coli_16S'])
print(comparative_df)


motif_positions = agent.motif_search('COVID_Spike', 'ATG')
print(f"ATG begin codons discovered at positions: {motif_positions}")


tree = agent.create_phylogenetic_tree(sequences=[
   str(agent.sequences['COVID_Spike'].seq[:300]),
   str(agent.sequences['Human_Insulin'].seq[:300]),
   str(agent.sequences['E_coli_16S'].seq[:300])
])


if tree:
   agent.visualize_tree(tree)

We visualize nucleotide composition, scan E. coli 16S GC% in sliding home windows, and profile codon utilization for the COVID Spike sequence. We then evaluate sequences side-by-side, search for the “ATG” motif, and construct/plot a fast phylogenetic tree from the primary 300 nt of every sequence.

In conclusion, we now have a totally purposeful BioPython AI Agent able to dealing with a number of layers of sequence evaluation, from fundamental nucleotide composition to codon utilization profiling, GC-content sliding home windows, motif searches, and even comparative analyses throughout species. The integration of visualization and phylogenetic tree development offers each intuitive and in-depth insights into genetic information. Whether for tutorial initiatives, bioinformatics training, or analysis prototyping, this Colab-friendly workflow showcases how open-source instruments like Biopython could be harnessed with fashionable AI-inspired pipelines to simplify and speed up organic information exploration.

Check out the FULL CODES here. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter.

The submit How to Create a Bioinformatics AI Agent Using Biopython for DNA and Protein Analysis appeared first on MarkTechPost.