Building a Reliable End-to-End Machine Learning Pipeline Using MLE-Agent and Ollama Locally

We start this tutorial by displaying how we are able to mix MLE-Agent with Ollama to create a completely native, API-free machine studying workflow. We arrange a reproducible atmosphere in Google Colab, generate a small artificial dataset, after which information the agent to draft a coaching script. To make it strong, we sanitize widespread errors, guarantee appropriate imports, and add a assured fallback script. This fashion, we hold the workflow clean whereas nonetheless benefiting from automation. Take a look at the FULL CODES here.

Copy Code

import os, re, time, textwrap, subprocess, sys
from pathlib import Path


def sh(cmd, verify=True, env=None, cwd=None):
   print(f"$ {cmd}")
   p = subprocess.run(cmd, shell=True, env={**os.environ, **(env or {})} if env else None,
                      cwd=cwd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, textual content=True)
   print(p.stdout)
   if verify and p.returncode!=0: elevate RuntimeError(p.stdout)
   return p.stdout

We outline a helper perform sh that we use to run shell instructions. We print the command, seize its output, and lift an error if it fails in order that we are able to monitor execution in actual time. Take a look at the FULL CODES here.

Copy Code

WORK=Path("/content material/mle_colab_demo"); WORK.mkdir(dad and mom=True, exist_ok=True)
PROJ=WORK/"proj"; PROJ.mkdir(exist_ok=True)
DATA=WORK/"information.csv"; MODEL=WORK/"mannequin.joblib"; PREDS=WORK/"preds.csv"
SAFE=WORK/"train_safe.py"; RAW=WORK/"agent_train_raw.py"; FINAL=WORK/"prepare.py"
MODEL_NAME=os.environ.get("OLLAMA_MODEL","llama3.2:1b")


sh("pip -q set up --upgrade pip")
sh("pip -q set up mle-agent==0.4.* scikit-learn pandas numpy joblib")


sh("curl -fsSL https://ollama.com/set up.sh | sh")
sv = subprocess.Popen("ollama serve", shell=True)
time.sleep(4); sh(f"ollama pull {MODEL_NAME}")

We arrange our Colab workspace paths and filenames, then set up the precise Python dependencies we want. We set up and launch Ollama regionally, pull the chosen mannequin, and hold the server operating so we are able to generate code with none exterior API keys. Take a look at the FULL CODES here.

Copy Code

import numpy as np, pandas as pd
np.random.seed(0)
n=500; X=np.random.rand(n,4); y=([email protected]([0.4,-0.2,0.1,0.5])+0.15*np.random.randn(n)>0.55).astype(int)
pd.DataFrame(np.c_[X,y], columns=["f1","f2","f3","f4","target"]).to_csv(DATA, index=False)


env = {"OPENAI_API_KEY":"", "ANTHROPIC_API_KEY":"", "GEMINI_API_KEY":"",
      "OLLAMA_HOST":"http://127.0.0.1:11434", "MLE_LLM_ENGINE":"ollama","MLE_MODEL":MODEL_NAME}
immediate=f"""Return ONE fenced python code block solely.
Write prepare.py that reads {DATA}; 80/20 cut up (random_state=42, stratify);
Pipeline: SimpleImputer + StandardScaler + LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42);
Print ROC-AUC & F1; print sorted coefficient magnitudes; save mannequin to {MODEL} and preds to {PREDS};
Use solely sklearn, pandas, numpy, joblib; no further textual content."""
def extract(txt:str)->str|None:
   txt=re.sub(r"x1B[[0-?]*[ -/]*[@-~]", "", txt)
   m=re.search(r"```(?:python)?s*([sS]*?)```", txt, re.I)
   if m: return m.group(1).strip()
   if txt.strip().decrease().startswith("python"): return txt.strip()[6:].strip()
   m=re.search(r"(?:^|n)(froms+[^n]+|imports+[^n]+)([sS]*)", txt);
   return (m.group(1)+m.group(2)).strip() if m else None


out = sh(f'printf %s "{immediate}" | mle chat', verify=False, cwd=str(PROJ), env=env)
code = extract(out) or sh(f'printf %s "{immediate}" | ollama run {MODEL_NAME}', verify=False, env=env)
code = extract(code) if code and never isinstance(code, str) else (code or "")
(Path(RAW)).write_text(code or "", encoding="utf-8")

We generate a tiny labeled dataset and set atmosphere variables so we are able to drive MLE-Agent by Ollama regionally. We craft a strict immediate for prepare.py and outline an extract helper that pulls solely the fenced Python code. We then ask MLE-Agent (falling again to ollama run if wanted) and save the uncooked generated script to disk for sanitization. Take a look at the FULL CODES here.

Copy Code

def sanitize(src:str)->str:
   if not src: return ""
   s = src
   s = re.sub(r"r","",s)
   s = re.sub(r"^pythonb","",s.strip(), flags=re.I).strip()
   fixes = {
       r"froms+sklearn.pipelines+imports+SimpleImputer": "from sklearn.impute import SimpleImputer",
       r"froms+sklearn.preprocessings+imports+SimpleImputer": "from sklearn.impute import SimpleImputer",
       r"froms+sklearn.pipelines+imports+StandardScaler": "from sklearn.preprocessing import StandardScaler",
       r"froms+sklearn.preprocessings+imports+ColumnTransformer": "from sklearn.compose import ColumnTransformer",
       r"froms+sklearn.pipelines+imports+ColumnTransformer": "from sklearn.compose import ColumnTransformer",
   }
   for pat,rep in fixes.gadgets(): s = re.sub(pat, rep, s)
   if "SimpleImputer" in s and "from sklearn.impute import SimpleImputer" not in s:
       s = "from sklearn.impute import SimpleImputern"+s
   if "StandardScaler" in s and "from sklearn.preprocessing import StandardScaler" not in s:
       s = "from sklearn.preprocessing import StandardScalern"+s
   if "ColumnTransformer" in s and "from sklearn.compose import ColumnTransformer" not in s:
       s = "from sklearn.compose import ColumnTransformern"+s
   if "train_test_split" in s and "from sklearn.model_selection import train_test_split" not in s:
       s = "from sklearn.model_selection import train_test_splitn"+s
   if "joblib" in s and "import joblib" not in s: s = "import joblibn"+s
   return s


san = sanitize(code)


secure = textwrap.dedent(f"""
import pandas as pd, numpy as np, joblib
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.compose import ColumnTransformer


DATA=Path("{DATA}"); MODEL=Path("{MODEL}"); PREDS=Path("{PREDS}")
df=pd.read_csv(DATA); X=df.drop(columns=["target"]); y=df["target"].astype(int)
num=X.columns.tolist()
pre=ColumnTransformer([("num",Pipeline([("imp",SimpleImputer()),("sc",StandardScaler())]),num)])
clf=LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
pipe=Pipeline([("pre",pre),("clf",clf)])
Xtr,Xte,ytr,yte=train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)
pipe.match(Xtr,ytr)
proba=pipe.predict_proba(Xte)[:,1]; pred=(proba>=0.5).astype(int)
print("ROC-AUC:",spherical(roc_auc_score(yte,proba),4)); print("F1:",spherical(f1_score(yte,pred),4))
import pandas as pd
coef=pd.Sequence(pipe.named_steps["clf"].coef_.ravel(), index=num).abs().sort_values(ascending=False)
print("Prime coefficients by |magnitude|:n", coef.to_string())
joblib.dump(pipe,MODEL)
pd.DataFrame({{"y_true":yte.reset_index(drop=True),"y_prob":proba,"y_pred":pred}}).to_csv(PREDS,index=False)
print("Saved:",MODEL,PREDS)
""").strip()

We sanitize the agent-generated script by stripping stray prefixes and auto-fixing widespread scikit-learn import errors, then we prepend any lacking important imports so it runs cleanly. We additionally put together a secure, absolutely deterministic fallback prepare.py that we are able to run even when the agent’s code is imperfect, making certain we at all times prepare, consider, and persist artifacts reliably. Take a look at the FULL CODES here.

Copy Code

chosen = san if ("import " in san and "sklearn" in san and "read_csv" in san) else secure
Path(SAFE).write_text(secure, encoding="utf-8")
Path(FINAL).write_text(chosen, encoding="utf-8")
print("n=== Utilizing prepare.py (first 800 chars) ===n", chosen[:800], "n...")


sh(f"python {FINAL}")
print("nArtifacts:", [str(p) for p in WORK.glob('*')])
print(" Performed — outputs in", WORK)

We determine whether or not to run the sanitized agent code or fall again to the secure script, then save each for reference. We execute the chosen prepare.py, print a preview of its contents, after which checklist all generated artifacts to substantiate the workflow completes efficiently.

We conclude by operating the sanitized or secure model of the coaching script, evaluating ROC-AUC and F1, printing coefficient magnitudes, and saving all artifacts. Via this course of, we display how we are able to combine native LLMs with conventional ML pipelines whereas preserving reliability and security. The result’s a hands-on framework that permits us to manage execution, keep away from exterior keys, and nonetheless leverage automation for real-world mannequin coaching.

Take a look at the FULL CODES here. Be happy to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be at liberty to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

The put up Building a Reliable End-to-End Machine Learning Pipeline Using MLE-Agent and Ollama Locally appeared first on MarkTechPost.