|

How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance?

🔑

In this tutorial, we stroll by an superior end-to-end information science workflow the place we mix conventional machine studying with the ability of Gemini. We start by getting ready and modeling the diabetes dataset, then we dive into analysis, function significance, and partial dependence. Along the best way, we herald Gemini as our AI information scientist to clarify outcomes, reply exploratory questions, and spotlight dangers. By doing this, we construct a predictive mannequin whereas additionally enhancing our insights and decision-making by pure language interplay. Check out the FULL CODES here.

!pip -qU google-generativeai scikit-learn matplotlib pandas numpy
from getpass import getpass
import os, json, numpy as np, pandas as pd, matplotlib.pyplot as plt


if not os.environ.get("GOOGLE_API_KEY"):
   os.environ["GOOGLE_API_KEY"] = getpass("🔑 Enter your Gemini API key (hidden): ")


import google.generativeai as genai
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
LLM = genai.GenerativeModel("gemini-1.5-flash")


def ask_llm(immediate, sys=None):
   p = immediate if sys is None else f"System:n{sys}nnUser:n{immediate}"
   r = LLM.generate_content(p)
   return (getattr(r, "textual content", "") or "").strip()


from sklearn.datasets import load_diabetes
uncooked = load_diabetes(as_frame=True)
df  = uncooked.body.rename(columns={"goal":"disease_progression"})
print("Shape:", df.form); show(df.head())


from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import Pipeline


X = df.drop(columns=["disease_progression"]); y = df["disease_progression"]
num_cols = X.columns.tolist()
pre = ColumnTransformer(
   [("scale", StandardScaler(), num_cols),
    ("rank",  QuantileTransformer(n_quantiles=min(200, len(X)), output_distribution="normal"), num_cols)],
   the rest="drop", verbose_feature_names_out=False)
mannequin = HistGradientBoostingRegressor(max_depth=3, learning_rate=0.07,
                                     l2_regularization=0.0, max_iter=500,
                                     early_stopping=True, validation_fraction=0.15)
pipe  = Pipeline([("prep", pre), ("hgbt", model)])


Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.20, random_state=42)
cv = KFold(n_splits=5, shuffle=True, random_state=42)
cv_mse = -cross_val_score(pipe, Xtr, ytr, scoring="neg_mean_squared_error", cv=cv).imply()
cv_rmse = float(cv_mse ** 0.5)
pipe.match(Xtr, ytr)

We load the diabetes dataset, preprocess the options, and construct a sturdy pipeline utilizing scaling, quantile transformation, and gradient boosting. We cut up the info, carry out cross-validation to estimate RMSE, and then match the ultimate mannequin to see how effectively it generalizes. Check out the FULL CODES here.

pred_tr = pipe.predict(Xtr); pred_te = pipe.predict(Xte)
rmse_tr = mean_squared_error(ytr, pred_tr) ** 0.5
rmse_te = mean_squared_error(yte, pred_te) ** 0.5
mae_te  = mean_absolute_error(yte, pred_te)
r2_te   = r2_score(yte, pred_te)
print(f"CV RMSE={cv_rmse:.2f} | Train RMSE={rmse_tr:.2f} | Test RMSE={rmse_te:.2f} | Test MAE={mae_te:.2f} | R²={r2_te:.3f}")


plt.determine(figsize=(5,4))
plt.scatter(pred_te, yte - pred_te, s=12)
plt.axhline(0, lw=1); plt.xlabel("Predicted"); plt.ylabel("Residual"); plt.title("Residuals (Test)")
plt.present()


from sklearn.inspection import permutation_importance
imp = permutation_importance(pipe, Xte, yte, scoring="neg_mean_squared_error", n_repeats=10, random_state=0)
imp_df = pd.DataBody({"function": X.columns, "significance": imp.importances_mean}).sort_values("significance", ascending=False)
show(imp_df.head(10))


plt.determine(figsize=(6,4))
top10 = imp_df.head(10).iloc[::-1]
plt.barh(top10["feature"], top10["importance"])
plt.title("Permutation Importance (Top 10)"); plt.xlabel("Δ(MSE)"); plt.tight_layout(); plt.present()

We consider our mannequin by computing practice, check, and cross-validation metrics, and visualize residuals to examine prediction errors. We then calculate permutation significance to determine which options drive the mannequin most, and show the highest contributors utilizing a transparent bar plot. Check out the FULL CODES here.

def compute_pdp(pipe, Xref: pd.DataBody, feat: str, grid=40):
   xs = np.linspace(np.percentile(Xref[feat], 5), np.percentile(Xref[feat], 95), grid)
   Xtmp = Xref.copy()
   ys = []
   for v in xs:
       Xtmp[feat] = v
       ys.append(pipe.predict(Xtmp).imply())
   return xs, np.array(ys)


top_feats = imp_df["feature"].head(3).tolist()
plt.determine(figsize=(6,4))
for f in top_feats:
   xs, ys = compute_pdp(pipe, Xte.copy(), f, grid=40)
   plt.plot(xs, ys, label=f)
plt.legend(); plt.xlabel("Feature worth"); plt.ylabel("Predicted goal"); plt.title("Manual PDP (Top 3)")
plt.tight_layout(); plt.present()




report_obj = {
   "dataset": {"rows": int(df.form[0]), "cols": int(df.form[1]-1), "goal": "disease_progression"},
   "metrics": {"cv_rmse": float(cv_rmse), "train_rmse": float(rmse_tr),
               "test_rmse": float(rmse_te), "test_mae": float(mae_te), "r2": float(r2_te)},
   "top_importances": imp_df.head(10).to_dict(orient="information")
}
print(json.dumps(report_obj, indent=2))


sys_msg = ("You are a senior information scientist. Return: (1) ≤120-word govt abstract, "
          "(2) key dangers/assumptions bullets, (3) 5 prioritized subsequent experiments w/ rationale, "
          "(4) quick-win function engineering concepts as Python pseudocode.")
abstract = ask_llm(f"Dataset + metrics + importances:n{json.dumps(report_obj)}", sys=sys_msg)
print("n📊 Gemini Executive Briefn" + "-"*80 + f"n{abstract}n")

We compute the handbook partial dependence for the highest three options and visualize how altering each impacts the predictions. We then assemble a compact JSON report of dataset statistics, metrics, and importances, and ask Gemini to generate an govt temporary that features dangers, subsequent experiments, and quick-win function engineering concepts. Check out the FULL CODES here.

SAFE_GLOBALS = {"pd": pd, "np": np}
def run_generated_pandas(code: str, df_local: pd.DataBody):
   banned = ["__", "import", "open(", "exec(", "eval(", "os.", "sys.", "pd.read", "to_csv", "to_pickle", "to_sql"]
   if any(b in code for b in banned): increase ValueError("Unsafe code rejected.")
   loc = {"df": df_local.copy()}
   exec(code, SAFE_GLOBALS, loc)
   return {okay:v for okay,v in loc.objects() if okay not in ("df",)}


def eda_qa(query: str):
   immediate = f"""You are a Python+Pandas analyst. DataBody `df` columns:
{listing(df.columns)}. Write a SHORT pandas snippet (no feedback/prints) that computes the reply to:
"{query}". Use solely pd/np/df; assign the ultimate end result to a variable named `reply`."""
   code = ask_llm(immediate, sys="Return solely code. No prose.")
   attempt:
       out = run_generated_pandas(code, df)
       return code, out.get("reply", None)
   besides Exception as e:
       return code, f"[Execution error: {e}]"


questions = [
   "What is the Pearson correlation between BMI and disease_progression?",
   "Show mean target by tertiles of BMI (low/med/high).",
   "Which single feature correlates most with the target (absolute value)?"
]
for q in questions:
   code, ans = eda_qa(q)
   print("nQ:", q, "nCode:n", code, "nAnswer:n", ans)

We construct a protected sandbox to execute pandas code that Gemini generates for exploratory information evaluation. We then ask pure language questions on correlations and function relationships, let Gemini write the pandas snippets, and routinely run them to get direct solutions from the dataset. Check out the FULL CODES here.

crossitique = ask_llm(
   f"""Metrics: {report_obj['metrics']}
Top importances: {report_obj['top_importances']}
Identify dangers round leakage, overfitting, calibration, OOD robustness, and equity (even proxy-only).
Propose fast checks (concise Python sketches)."""
)
print("n🧪 Gemini Risk & Robustness Reviewn" + "-"*80 + f"n{critique}n")


def what_if(pipe, Xref: pd.DataBody, feat: str, delta: float = 0.05):
   x0 = Xref.median(numeric_only=True).to_dict()
   x1, x2 = x0.copy(), x0.copy()
   if feat not in x1: return np.nan
   x2[feat] = x1[feat] + delta
   X1 = pd.DataBody([x1], columns=X.columns)
   X2 = pd.DataBody([x2], columns=X.columns)
   return float(pipe.predict(X2)[0] - pipe.predict(X1)[0])


for f in top_feats:
   print(f"Estimated Δtarget if {f} will increase by +0.05 ≈ {what_if(pipe, Xte, f, 0.05):.2f}")


print("n✅ Done: Train → Explain → Query with Gemini → Review dangers → What-if evaluation. "
     "Swap the dataset or tweak mannequin params to prolong this pocket book.")

We ask Gemini to assessment our mannequin for dangers like leakage, overfitting, and equity, and get fast Python checks as strategies. We then run easy “what-if” analyses to see how small modifications in high options have an effect on predictions, serving to us interpret the mannequin’s conduct extra clearly.

In conclusion, we see how seamlessly we will mix machine studying pipelines with Gemini’s reasoning to make information science extra interactive and insightful. We practice, consider, and interpret a mannequin, then ask Gemini to summarize findings, counsel enhancements, and critique dangers. Through this journey, we set up a workflow that permits us to obtain each predictive efficiency and interpretability, whereas additionally benefiting from having an AI collaborator in our information evaluation course of.


Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t neglect to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter.

The put up How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance? appeared first on MarkTechPost.

Similar Posts