How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery
In this tutorial, we carry out a sophisticated single-cell RNA-seq evaluation workflow utilizing Scanpy on the PBMC-3k benchmark dataset. We begin by loading the dataset, inspecting its construction, and making use of high quality management checks to consider gene counts, complete counts, mitochondrial content material, and ribosomal gene alerts. We then filter low-quality cells and genes, detect potential doublets with Scrublet, normalize the information, apply log transformation, and determine extremely variable genes for downstream evaluation. Also, we rating cell-cycle phases, regress out undesirable technical variation, scale the information, and scale back dimensionality utilizing PCA, UMAP, and t-SNE. We additionally cluster cells with the Leiden algorithm, determine marker genes, annotate cell populations utilizing canonical PBMC markers, discover trajectory construction with PAGA and diffusion pseudotime, calculate a customized interferon-response rating, and lastly save the totally analyzed AnnData object for future use.
!pip set up -q scanpy leidenalg python-igraph scrublet
import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor="white", figsize=(5, 5))
sc.logging.print_header()
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
print(adata)
adata.var["mt"] = adata.var_names.str.startswith("MT-")
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True
)
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4, multi_panel=True,
)
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")
We set up the required single-cell evaluation libraries and import Scanpy, NumPy, Pandas, Matplotlib, and warning controls. We load the PBMC-3k benchmark dataset, make gene names distinctive, and examine the AnnData object construction. We then calculate high quality management metrics for mitochondrial and ribosomal genes and visualize count-level high quality patterns utilizing violin and scatter plots.
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.n_genes_by_counts < 2500, :].copy()
adata = adata[adata.obs.pct_counts_mt < 5, :].copy()
sc.pp.scrublet(adata)
print("Predicted doublets:", int(adata.obs["predicted_doublet"].sum()))
adata = adata[~adata.obs["predicted_doublet"], :].copy()
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata.uncooked = adata
adata = adata[:, adata.var.highly_variable].copy()
We filter out low-quality cells and hardly ever detected genes to enhance the reliability of the dataset. We use Scrublet by way of Scanpy to determine predicted doublets and take away them earlier than deeper evaluation. We then protect uncooked counts, normalize expression values, apply log transformation, choose extremely variable genes, and maintain solely probably the most informative options.
s_genes = ["MCM5","PCNA","TYMS","FEN1","MCM2","MCM4","RRM1","UNG","GINS2",
"MCM6","CDCA7","DTL","PRIM1","UHRF1","HELLS","RFC2","NASP",
"RAD51AP1","GMNN","WDR76","SLBP","CCNE2","UBR7","POLD3","MSH2",
"ATAD2","RAD51","RRM2","CDC45","CDC6","EXO1","TIPIN","DSCC1",
"BLM","CASP8AP2","USP1","CLSPN","POLA1","CHAF1B","E2F8"]
g2m_genes = ["HMGB2","CDK1","NUSAP1","UBE2C","BIRC5","TPX2","TOP2A","NDC80",
"CKS2","NUF2","CKS1B","MKI67","TMPO","CENPF","TACC3","SMC4",
"CCNB2","CKAP2L","CKAP2","AURKB","BUB1","KIF11","ANP32E",
"TUBB4B","GTSE1","KIF20B","HJURP","CDCA3","CDC20","TTK",
"CDC25C","KIF2C","RANGAP1","NCAPD2","DLGAP5","CDCA2","CDCA8",
"ECT2","KIF23","HMMR","AURKA","PSRC1","ANLN","LBR","CKAP5",
"CENPE","NEK2","G2E3","CBX5","CENPA"]
s_genes = [g for g in s_genes if g in adata.var_names]
g2m_genes = [g for g in g2m_genes if g in adata.var_names]
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.tsne(adata, n_pcs=40)
We outline S-phase and G2/M-phase marker genes and retain solely these current within the dataset. We rating every cell for cell-cycle section, regress out undesirable variation from complete counts and mitochondrial proportion, and scale the information for downstream modeling. We then run PCA, examine defined variance, assemble the neighborhood graph, and generate UMAP and t-SNE embeddings.
sc.tl.leiden(adata, decision=0.5, taste="igraph", n_iterations=2, directed=False)
sc.pl.umap(adata, shade="leiden", legend_loc="on knowledge", title="Leiden clusters")
sc.pl.tsne(adata, shade="leiden", legend_loc="on knowledge", title="t-SNE clusters")
sc.tl.rank_genes_groups(adata, "leiden", methodology="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
consequence = adata.uns["rank_genes_groups"]
teams = consequence["names"].dtype.names
top_df = pd.DataBody({g: consequence["names"][g][:10] for g in teams})
print("nTop 10 markers per cluster:n", top_df)
marker_genes = {
"B-cell": ["CD79A", "MS4A1"],
"CD8 T-cell": ["CD8A", "CD8B"],
"CD4 T-cell": ["IL7R", "CD4"],
"NK": ["GNLY", "NKG7"],
"CD14 Monocyte": ["CD14", "LYZ"],
"FCGR3A Monocyte": ["FCGR3A", "MS4A7"],
"Dendritic": ["FCER1A", "CST3"],
"Megakaryocyte": ["PPBP"],
}
sc.pl.dotplot(adata, marker_genes, groupby="leiden", standard_scale="var")
sc.pl.stacked_violin(adata, marker_genes, groupby="leiden", swap_axes=True)
We apply Leiden clustering to group cells primarily based on the neighborhood graph and visualize the clusters on UMAP and t-SNE plots. We carry out differential expression evaluation utilizing the Wilcoxon take a look at to determine the highest marker genes for every cluster. We then use canonical PBMC marker genes to assist cell-type annotation by way of dot plots and stacked violin plots.
sc.tl.paga(adata, teams="leiden")
sc.pl.paga(adata, shade="leiden", threshold=0.1)
sc.tl.umap(adata, init_pos="paga")
sc.pl.umap(adata, shade="leiden", legend_loc="on knowledge")
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep="X_diffmap")
adata.uns["iroot"] = np.flatnonzero(adata.obs["leiden"] == adata.obs["leiden"].cat.classes[0])[0]
sc.tl.dpt(adata)
sc.pl.umap(adata, shade=["leiden", "dpt_pseudotime"], legend_loc="on knowledge")
ifn_genes = ["ISG15", "IFI6", "IFIT1", "IFIT3", "MX1", "OAS1", "STAT1", "IRF7"]
ifn_genes = [g for g in ifn_genes if g in adata.raw.var_names]
sc.tl.score_genes(adata, gene_list=ifn_genes, score_name="IFN_score")
sc.pl.umap(adata, shade="IFN_score", cmap="viridis")
adata.write("pbmc3k_analyzed.h5ad")
print("n
Analysis full — saved to pbmc3k_analyzed.h5ad")
print(adata)
We run PAGA to mannequin connectivity between Leiden clusters and reinitialize UMAP utilizing the PAGA graph to receive a clearer trajectory construction. We compute diffusion maps and diffusion pseudotime to discover attainable development patterns throughout cell states. We additionally calculate an interferon-response gene-set rating, visualize it on UMAP, and save the ultimate analyzed object as an .h5ad file.
In conclusion, we constructed an end-to-end Scanpy pipeline for single-cell RNA-seq evaluation, reworking uncooked PBMC knowledge into interpretable organic insights. We cleaned and preprocessed the dataset, eliminated noisy cells and doublets, chosen informative genes, and generated significant embeddings to visualize mobile construction. We then used Leiden clustering and differential expression evaluation to uncover marker genes and join clusters to identified immune cell varieties. By including PAGA, diffusion pseudotime, and customized gene-set scoring, we prolonged the workflow past primary clustering and confirmed how Scanpy helps deeper organic interpretation. At final, we had a saved .h5ad object that incorporates the processed knowledge, annotations, scores, clusters, and visible evaluation outcomes, prepared for downstream exploration or reporting.
Check out the Full Codes with Notebook here. Also, be happy to comply with us on Twitter and don’t overlook to be part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The submit How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery appeared first on MarkTechPost.
