How to Create a Bioinformatics AI Agent Using Biopython for DNA and Protein Analysis
In this tutorial, we show how to construct a sophisticated but accessible Bioinformatics AI Agent utilizing Biopython and common Python libraries, designed to run seamlessly in Google Colab. By combining sequence retrieval, molecular evaluation, visualization, a number of sequence alignment, phylogenetic tree development, and motif searches into a single streamlined class, the tutorial offers a hands-on method to discover the complete spectrum of organic sequence evaluation. Users can begin with built-in pattern sequences such because the SARS-CoV-2 Spike protein, Human Insulin precursor, and E. coli 16S rRNA, or fetch customized sequences straight from NCBI. With built-in visualization instruments powered by Plotly and Matplotlib, researchers and college students alike can shortly carry out complete DNA and protein analyses while not having prior setup past a Colab pocket book. Check out the FULL CODES here.
!pip set up biopython pandas numpy matplotlib seaborn plotly requests beautifulsoup4 scipy scikit-learn networkx
!apt-get replace
!apt-get set up -y clustalw
import os
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.specific as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from Bio import SeqIO, Entrez, Align, Phylo
from Bio.Seq import Seq
from Bio.SeqDocument import SeqDocument
from Bio.SeqUtils import gc_fraction, molecular_weight
from Bio.SeqUtils.ProtParam import ProteinAnalysis
from Bio.Blast import NCBIWWW, NCBIXML
from Bio.Phylo.TreeDevelopment import DistanceCalculator, DistanceTreeConstructor
import warnings
warnings.filterwarnings('ignore')
Entrez.e-mail = "[email protected]"
We start by putting in important bioinformatics and information science libraries, together with ClustalW for sequence alignment. We then import Biopython modules, visualization instruments, and supporting packages, whereas organising Entrez with our e-mail to fetch sequences from NCBI. This ensures our Colab surroundings is totally ready for superior sequence evaluation. Check out the FULL CODES here.
class BioPythonAIAgent:
def __init__(self, e-mail="[email protected]"):
self.e-mail = e-mail
Entrez.e-mail = e-mail
self.sequences = {}
self.analysis_results = {}
self.alignments = {}
self.timber = {}
def fetch_sequence_from_ncbi(self, accession_id, db="nucleotide", rettype="fasta"):
attempt:
deal with = Entrez.efetch(db=db, id=accession_id, rettype=rettype, retmode="textual content")
report = SeqIO.learn(deal with, "fasta")
deal with.shut()
self.sequences[accession_id] = report
return report
besides Exception as e:
print(f"Error fetching sequence: {str(e)}")
return None
def create_sample_sequences(self):
covid_spike = "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT"
human_insulin = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
e_coli_16s = "AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAATGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGCGTTAAGGTTAATAACCTTGGCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA"
sample_sequences = [
("COVID_Spike", covid_spike, "SARS-CoV-2 Spike Protein"),
("Human_Insulin", human_insulin, "Human Insulin Precursor"),
("E_coli_16S", e_coli_16s, "E. coli 16S rRNA")
]
for seq_id, seq_str, desc in sample_sequences:
report = SeqDocument(Seq(seq_str), id=seq_id, description=desc)
self.sequences[seq_id] = report
return sample_sequences
def analyze_sequence(self, sequence_id=None, sequence=None):
if sequence_id and sequence_id in self.sequences:
seq_record = self.sequences[sequence_id]
seq = seq_record.seq
description = seq_record.description
elif sequence:
seq = Seq(sequence)
description = "Custom sequence"
else:
return None
evaluation = {
'size': len(seq),
'composition': {}
}
for base in ['A', 'T', 'G', 'C']:
evaluation['composition'][base] = seq.rely(base)
if 'A' in evaluation['composition'] and 'T' in evaluation['composition']:
evaluation['gc_content'] = spherical(gc_fraction(seq) * 100, 2)
attempt:
evaluation['molecular_weight'] = spherical(molecular_weight(seq, seq_type='DNA'), 2)
besides:
evaluation['molecular_weight'] = len(seq) * 650
attempt:
if len(seq) % 3 == 0:
protein = seq.translate()
evaluation['translation'] = str(protein)
evaluation['stop_codons'] = protein.rely('*')
if '*' not in str(protein)[:-1]:
prot_analysis = ProteinAnalysis(str(protein)[:-1])
evaluation['protein_mw'] = spherical(prot_analysis.molecular_weight(), 2)
evaluation['isoelectric_point'] = spherical(prot_analysis.isoelectric_point(), 2)
evaluation['protein_composition'] = prot_analysis.get_amino_acids_percent()
besides:
cross
key = sequence_id if sequence_id else "customized"
self.analysis_results[key] = evaluation
return evaluation
def visualize_composition(self, sequence_id):
if sequence_id not in self.analysis_results:
return
evaluation = self.analysis_results[sequence_id]
fig = make_subplots(
rows=2, cols=2,
specs=[[{"type": "pie"}, {"type": "bar"}],
[{"colspan": 2}, None]],
subplot_titles=("Nucleotide Composition", "Base Count", "Sequence Properties")
)
labels = checklist(evaluation['composition'].keys())
values = checklist(evaluation['composition'].values())
fig.add_trace(
go.Pie(labels=labels, values=values, identify="Composition"),
row=1, col=1
)
fig.add_trace(
go.Bar(x=labels, y=values, identify="Count", marker_color=['red', 'blue', 'green', 'orange']),
row=1, col=2
)
properties = ['Length', 'GC%', 'MW (kDa)']
prop_values = [
analysis['length'],
evaluation.get('gc_content', 0),
evaluation.get('molecular_weight', 0) / 1000
]
fig.add_trace(
go.Scatter(x=properties, y=prop_values, mode='markers+strains',
marker=dict(measurement=10, colour='purple'), identify="Properties"),
row=2, col=1
)
fig.update_layout(
title=f"Comprehensive Analysis: {sequence_id}",
showlegend=False,
top=600
)
fig.present()
def perform_multiple_sequence_alignment(self, sequence_ids):
if len(sequence_ids) < 2:
return None
sequences = []
for seq_id in sequence_ids:
if seq_id in self.sequences:
sequences.append(self.sequences[seq_id])
if len(sequences) < 2:
return None
from Bio.Align import PairwiseAligner
aligner = PairwiseAligner()
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -2
aligner.extend_gap_score = -0.5
alignments = []
for i in vary(len(sequences)):
for j in vary(i+1, len(sequences)):
alignment = aligner.align(sequences[i].seq, sequences[j].seq)[0]
alignments.append(alignment)
return alignments
def create_phylogenetic_tree(self, alignment_key=None, sequences=None):
if alignment_key and alignment_key in self.alignments:
alignment = self.alignments[alignment_key]
elif sequences:
information = []
for i, seq in enumerate(sequences):
report = SeqDocument(Seq(seq), id=f"seq_{i}")
information.append(report)
SeqIO.write(information, "temp.fasta", "fasta")
attempt:
clustalw_cline = ClustalwCommandline("clustalw2", infile="temp.fasta")
stdout, stderr = clustalw_cline()
alignment = AlignIO.learn("temp.aln", "clustal")
os.take away("temp.fasta")
os.take away("temp.aln")
os.take away("temp.dnd")
besides:
return None
else:
return None
calculator = DistanceCalculator('identification')
dm = calculator.get_distance(alignment)
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
tree_key = f"tree_{len(self.timber)}"
self.timber[tree_key] = tree
return tree
def visualize_tree(self, tree):
fig, ax = plt.subplots(figsize=(10, 6))
Phylo.draw(tree, axes=ax)
plt.title("Phylogenetic Tree")
plt.tight_layout()
plt.present()
def protein_structure_analysis(self, sequence_id):
if sequence_id not in self.sequences:
return None
seq = self.sequences[sequence_id].seq
attempt:
if len(seq) % 3 == 0:
protein = seq.translate()
if '*' not in str(protein)[:-1]:
prot_analysis = ProteinAnalysis(str(protein)[:-1])
structure_analysis = {
'molecular_weight': prot_analysis.molecular_weight(),
'isoelectric_point': prot_analysis.isoelectric_point(),
'amino_acid_percent': prot_analysis.get_amino_acids_percent(),
'secondary_structure': prot_analysis.secondary_structure_fraction(),
'flexibility': prot_analysis.flexibility(),
'gravy': prot_analysis.gravy()
}
return structure_analysis
besides:
cross
return None
def comparative_analysis(self, sequence_ids):
outcomes = []
for seq_id in sequence_ids:
if seq_id in self.analysis_results:
evaluation = self.analysis_results[seq_id].copy()
evaluation['sequence_id'] = seq_id
outcomes.append(evaluation)
df = pd.DataFrame(outcomes)
if len(df) > 1:
fig = make_subplots(
rows=2, cols=2,
subplot_titles=("Length Comparison", "GC Content", "Molecular Weight", "Composition Heatmap")
)
fig.add_trace(
go.Bar(x=df['sequence_id'], y=df['length'], identify="Length"),
row=1, col=1
)
if 'gc_content' in df.columns:
fig.add_trace(
go.Scatter(x=df['sequence_id'], y=df['gc_content'], mode='markers+strains', identify="GC%"),
row=1, col=2
)
if 'molecular_weight' in df.columns:
fig.add_trace(
go.Bar(x=df['sequence_id'], y=df['molecular_weight'], identify="MW"),
row=2, col=1
)
fig.update_layout(title="Comparative Sequence Analysis", top=600)
fig.present()
return df
def codon_usage_analysis(self, sequence_id):
if sequence_id not in self.sequences:
return None
seq = self.sequences[sequence_id].seq
if len(seq) % 3 != 0:
return None
codons = {}
for i in vary(0, len(seq) - 2, 3):
codon = str(seq[i:i+3])
codons[codon] = codons.get(codon, 0) + 1
codon_df = pd.DataFrame(checklist(codons.gadgets()), columns=['Codon', 'Count'])
codon_df = codon_df.sort_values('Count', ascending=False)
fig = px.bar(codon_df.head(20), x='Codon', y='Count',
title=f"Top 20 Codon Usage - {sequence_id}")
fig.present()
return codon_df
def motif_search(self, sequence_id, motif_pattern):
if sequence_id not in self.sequences:
return []
seq = str(self.sequences[sequence_id].seq)
positions = []
for i in vary(len(seq) - len(motif_pattern) + 1):
if seq[i:i+len(motif_pattern)] == motif_pattern:
positions.append(i)
return positions
def gc_content_window(self, sequence_id, window_size=100):
if sequence_id not in self.sequences:
return None
seq = self.sequences[sequence_id].seq
gc_values = []
positions = []
for i in vary(0, len(seq) - window_size + 1, window_size//4):
window = seq[i:i+window_size]
gc_values.append(gc_fraction(window) * 100)
positions.append(i + window_size//2)
fig = go.Figure()
fig.add_trace(go.Scatter(x=positions, y=gc_values, mode='strains+markers',
identify=f'GC Content (window={window_size})'))
fig.update_layout(
title=f"GC Content Sliding Window Analysis - {sequence_id}",
xaxis_title="Position",
yaxis_title="GC Content (%)"
)
fig.present()
return positions, gc_values
def run_comprehensive_analysis(self, sequence_ids):
outcomes = {}
for seq_id in sequence_ids:
if seq_id in self.sequences:
evaluation = self.analyze_sequence(seq_id)
self.visualize_composition(seq_id)
gc_analysis = self.gc_content_window(seq_id)
codon_analysis = self.codon_usage_analysis(seq_id)
outcomes[seq_id] = {
'basic_analysis': evaluation,
'gc_window': gc_analysis,
'codon_usage': codon_analysis
}
if len(sequence_ids) > 1:
comparative_df = self.comparative_analysis(sequence_ids)
outcomes['comparative'] = comparative_df
return outcomes
We outline a BioPython AIAgent that permits us to fetch or create sequences, run core analyses (composition, GC%, translation, and protein properties), and visualize outcomes interactively. We additionally carry out pairwise alignments, construct phylogenetic timber, scan motifs, profile codon utilization, analyze GC with sliding home windows, and evaluate a number of sequences—then bundle every part into one complete pipeline. Check out the FULL CODES here.
agent = BioPythonAIAgent()
sample_seqs = agent.create_sample_sequences()
for seq_id, _, _ in sample_seqs:
agent.analyze_sequence(seq_id)
outcomes = agent.run_comprehensive_analysis(['COVID_Spike', 'Human_Insulin', 'E_coli_16S'])
print("BioPython AI Agent Tutorial Complete!")
print("Available sequences:", checklist(agent.sequences.keys()))
print("Available strategies:", [method for method in dir(agent) if not method.startswith('_')])
We instantiate the BioPythonAIAgent, generate pattern sequences (COVID Spike, Human Insulin, and E. coli 16S), and run a full evaluation pipeline. The outputs affirm that our agent efficiently performs nucleotide, codon, and GC-content analyses whereas additionally making ready comparative visualizations. Finally, we print the checklist of accessible sequences and supported strategies, indicating that the agent’s full analytical capabilities are actually prepared for use. Check out the FULL CODES here.
agent.visualize_composition('COVID_Spike')
agent.gc_content_window('E_coli_16S', window_size=50)
agent.codon_usage_analysis('COVID_Spike')
comparative_df = agent.comparative_analysis(['COVID_Spike', 'Human_Insulin', 'E_coli_16S'])
print(comparative_df)
motif_positions = agent.motif_search('COVID_Spike', 'ATG')
print(f"ATG begin codons discovered at positions: {motif_positions}")
tree = agent.create_phylogenetic_tree(sequences=[
str(agent.sequences['COVID_Spike'].seq[:300]),
str(agent.sequences['Human_Insulin'].seq[:300]),
str(agent.sequences['E_coli_16S'].seq[:300])
])
if tree:
agent.visualize_tree(tree)
We visualize nucleotide composition, scan E. coli 16S GC% in sliding home windows, and profile codon utilization for the COVID Spike sequence. We then evaluate sequences side-by-side, search for the “ATG” motif, and construct/plot a fast phylogenetic tree from the primary 300 nt of every sequence.
In conclusion, we now have a totally purposeful BioPython AI Agent able to dealing with a number of layers of sequence evaluation, from fundamental nucleotide composition to codon utilization profiling, GC-content sliding home windows, motif searches, and even comparative analyses throughout species. The integration of visualization and phylogenetic tree development offers each intuitive and in-depth insights into genetic information. Whether for tutorial initiatives, bioinformatics training, or analysis prototyping, this Colab-friendly workflow showcases how open-source instruments like Biopython could be harnessed with fashionable AI-inspired pipelines to simplify and speed up organic information exploration.
Check out the FULL CODES here. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter.
The submit How to Create a Bioinformatics AI Agent Using Biopython for DNA and Protein Analysis appeared first on MarkTechPost.