A Coding Implementation on Microsoft SkillOpt for Instrumented Prompt Optimization, Skill Evolution Analysis, and Baseline Comparison

In this tutorial, we implement an instrumented workflow for Microsoft SkillOpt. We arrange the SkillChoose repository, join it to OpenAI-compatible mannequin entry, configure the optimizer and goal fashions, and run the SearchQA optimization pipeline with a managed pattern restrict to maintain prices manageable. We first consider the unique seed talent as a baseline, then run an actual optimization loop by which SkillChoose improves the talent by way of rollout, reflection, aggregation, choice, updating, and validation-based gating. Along the way in which, we examine the coaching historical past, visualize modifications in accuracy, assessment edit-budget habits, monitor cumulative token utilization, and evaluate the advanced talent with the unique baseline.

SkillChoose Environment Setup

Copy Code

import os, re, json, glob, subprocess, pathlib, difflib
attempt:
   from google.colab import userdata
   OPENAI_KEY = userdata.get("OPENAI_API_KEY")
besides Exception:
   OPENAI_KEY = os.environ.get("OPENAI_API_KEY", "")
OPENAI_KEY = OPENAI_KEY or "sk-PASTE-YOUR-KEY-HERE"
assert OPENAI_KEY.startswith("sk-"), "Set an actual OpenAI key (Colab Secrets -> OPENAI_API_KEY)."
OPTIMIZER_MODEL = "gpt-4o"
TARGET_MODEL    = "gpt-4o-mini"
RUN  = "outputs/searchqa_adv"
LIMIT = 24
RUN_KNOBS = dict(num_epochs=2, batch_size=8, minibatch=4, merge_batch=4,
                employees=2, lr=4, lr_sched="cosine", restrict=LIMIT)
if not pathlib.Path("/content material/SkillChoose/scripts/practice.py").exists():
   subprocess.run("git clone --depth 1 https://github.com/microsoft/SkillChoose.git",
                  shell=True, cwd="/content material")
   subprocess.run('pip -q set up -e . && pip -q set up "openai>=1.0" pandas matplotlib',
                  shell=True, cwd="/content material/SkillChoose")
os.chdir("/content material/SkillChoose")
os.environ["AZURE_OPENAI_ENDPOINT"]  = "https://api.openai.com/v1"
os.environ["AZURE_OPENAI_API_KEY"]   = OPENAI_KEY
os.environ["AZURE_OPENAI_AUTH_MODE"] = "openai_compatible"
SPLIT = "information/searchqa_id_split"
CFG   = "configs/searchqa/default.yaml"
COMMON = ["--azure_openai_endpoint","https://api.openai.com/v1",
         "--cfg-options","model.backend=azure_openai",
         "model.azure_openai_auth_mode=openai_compatible"]

We put together the complete Colab setting for operating SkillChoose. We load the OpenAI API key, outline the optimizer and goal fashions, clone the SkillChoose repository, and set up the required dependencies. We additionally configure the OpenAI-compatible backend so the SkillChoose scripts can talk with the chosen fashions.

Baseline Skill Evaluation

Copy Code

def run_cli(args, tag):
   print("n" + "#"*80 + f"n# {tag}n# $ " + " ".be a part of(args) + "n" + "#"*80)
   p = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, textual content=True)
   buf = []
   for line in p.stdout:
       print(line, finish=""); buf.append(line)
   p.wait(); return "".be a part of(buf)
def parse_acc(txt):
   m = re.search(r"Results:s*arduous=([d.]+)s+smooth=([d.]+)", txt)
   if m: return {"arduous": float(m.group(1)), "smooth": float(m.group(2))}
   g = re.findall(r"arduous=([d.]+)", txt)
   return {"arduous": float(g[-1]), "smooth": None} if g else None
seed = "skillopt/envs/searchqa/expertise/preliminary.md"
if not pathlib.Path(seed).exists():
   seed = "baseline_skill.md"; pathlib.Path(seed).write_text("You reply questions from the given context.n")
base_out = run_cli(["python","scripts/eval_only.py","--config",CFG,
                   "--skill",seed,"--split","valid_unseen","--split_dir",SPLIT,
                   "--target_model",TARGET_MODEL,*COMMON,
                   "env.workers=1",f"env.limit={LIMIT}"],
                  "BASELINE EVAL (env seed talent, no coaching)")
base = parse_acc(base_out)

We outline helper features to run SkillChoose instructions and extract analysis accuracy from the output. We then find the preliminary seed talent utilized by the SearchQA setting and consider it on the unseen validation cut up. This offers us a baseline end result earlier than any optimization or coaching takes place.

Training And Visualization

Copy Code

ok = RUN_KNOBS
train_out = run_cli(["python","scripts/train.py","--config",CFG,"--split_dir",SPLIT,
   "--optimizer_model",OPTIMIZER_MODEL,"--target_model",TARGET_MODEL,"--out_root",RUN,
   *COMMON,
   "train.train_size=0",
   f"train.num_epochs={k['num_epochs']}", f"practice.batch_size={ok['batch_size']}",
   f"gradient.minibatch_size={ok['minibatch']}", f"gradient.merge_batch_size={ok['merge_batch']}",
   f"gradient.analyst_workers={ok['workers']}",
   f"optimizer.learning_rate={ok['lr']}", f"optimizer.lr_scheduler={ok['lr_sched']}",
   "optimizer.use_slow_update=true", "optimizer.use_meta_skill=true",
   f"env.employees={ok['workers']}", f"env.restrict={ok['limit']}"],
   "TRAIN (rollout->reflect->aggregate->select->update->gate; slow-update + meta-skill)")
import pandas as pd, matplotlib.pyplot as plt
hist = json.masses(pathlib.Path(f"{RUN}/historical past.json").read_text())
df = pd.json_normalize(hist)
print("nhistory.json columns:", checklist(df.columns))
def col(*cands):
   for c in cands:
       for precise in df.columns:
           if c in precise.decrease(): return precise
   return None
c_step = col("step")
x = df[c_step] if c_step else vary(len(df))
c_tr, c_va = col("train_acc","train_hard","practice"), col("val_acc","val_hard","legitimate","val")
c_lr, c_tok = col("edit_budget","lr","learning_rate","funds"), col("token","price")
fig, ax = plt.subplots(1, 3, figsize=(16,4))
if c_tr: ax[0].plot(x, df[c_tr], "o-", label="practice acc")
if c_va: ax[0].plot(x, df[c_va], "s-", label="val acc (gate)")
if base and base["hard"] is just not None: ax[0].axhline(base["hard"], ls="--", c="gray", label="baseline (seed)")
ax[0].set_title("Skill accuracy over steps"); ax[0].set_xlabel("step"); ax[0].legend(); ax[0].grid(alpha=.3)
if c_lr: ax[1].plot(x, df[c_lr], "d-", c="purple")
ax[1].set_title("Edit-budget / LR schedule (cosine)"); ax[1].set_xlabel("step"); ax[1].grid(alpha=.3)
if c_tok: ax[2].plot(x, pd.to_numeric(df[c_tok],errors="coerce").cumsum(), c="darkorange")
ax[2].set_title("Cumulative token utilization"); ax[2].set_xlabel("step"); ax[2].grid(alpha=.3)
plt.tight_layout(); plt.savefig(f"{RUN}/training_dashboard.png", dpi=120); plt.present()

We run the principle SkillChoose coaching loop with the chosen optimizer and goal fashions. We configure necessary coaching settings corresponding to epochs, batch measurement, minibatch measurement, studying fee, sluggish replace, meta-skill, and information restrict. We then learn the coaching historical past, visualize accuracy, edit-budget habits, and cumulative token utilization on a dashboard.

Inspecting Skill Evolution

Copy Code

snaps = sorted(glob.glob(f"{RUN}/expertise/skill_v*.md"))
greatest  = pathlib.Path(f"{RUN}/best_skill.md").read_text()
print("n" + "="*80 + f"nSKILL EVOLUTION: {len(snaps)} snapshots; diff v0 -> best_skilln" + "="*80)
if snaps:
   diff = difflib.unified_diff(pathlib.Path(snaps[0]).read_text().splitlines(),
                               greatest.splitlines(), snaps[0].cut up('/')[-1], "best_skill.md", lineterm="")
   print("n".be a part of(checklist(diff)[:120]) or "(no textual diff captured)")
prot = re.search(r"(SLOW_UPDATE.*?)$", greatest, re.S)
print("n--- protected SLOW_UPDATE block ---n",
     prot.group(1)[:1500] if prot else "(none — seems after an epoch boundary)")
patch = (sorted(glob.glob(f"{RUN}/steps/step_*/patches/*.json")) or [None])[0]
analy = (sorted(glob.glob(f"{RUN}/steps/step_*/evaluation/*")) or [None])[0]
print("n" + "="*80 + "nTEXTUAL GRADIENT — one aggregated patch (clipped to edit funds):n" + "="*80)
print(pathlib.Path(patch).read_text()[:1500] if patch else "(no patch information)")
print("n--- one uncooked Reflect-stage evaluation ---n",
     pathlib.Path(analy).read_text()[:1000] if analy else "(no evaluation information)")
for identify in ("slow_update", "meta_skill"):
   information = sorted(glob.glob(f"{RUN}/{identify}/epoch_*/*"))
   print(f"n[{name}] {len(information)} artifact(s):", [pathlib.Path(f).name for f in files[:6]])

We examine how the talent evolves through the optimization course of. We evaluate the primary saved talent snapshot with the ultimate greatest talent, verify whether or not a protected slow-update block seems, and assessment one generated patch and one reflection evaluation. We additionally checklist the slow-update and meta-skill artifacts created throughout epoch-level coaching.

Final Evaluation Comparison

Copy Code

best_out = run_cli(["python","scripts/eval_only.py","--config",CFG,
                   "--skill",f"{RUN}/best_skill.md","--split","valid_unseen","--split_dir",SPLIT,
                   "--target_model",TARGET_MODEL,*COMMON,"env.workers=1",f"env.limit={LIMIT}"],
                  "FINAL TEST EVAL (best_skill)")
educated = parse_acc(best_out)
print("n" + "="*80 + "nRESULT (arduous = actual match, the gated metric)n" + "="*80)
print(f"baseline seed talent : {base}")
print(f"educated best_skill  : {educated}")
if base and educated:
   print(f"hard-match elevate     : {educated['hard'] - base['hard']:+.4f}")
print(f"nDeployable artifact: {RUN}/best_skill.md  ({len(greatest)} chars)")

We consider the ultimate optimized best_skill.md file on the unseen validation cut up. We evaluate the educated talent’s hard-match rating with the unique baseline rating to measure the development. We end by printing the ultimate elevate and the trail to the deployable optimized talent artifact.

Conclusion

In conclusion, we constructed a whole SkillChoose experiment that goes past merely beginning a coaching command. We measured the baseline seed talent, optimized it utilizing a stronger mannequin because the optimizer and a smaller mannequin because the goal agent, and inspected how the talent advanced throughout coaching steps by way of saved snapshots, patches, reflections, sluggish updates, and meta-skill artifacts. We additionally generated a coaching dashboard that helps us perceive whether or not the optimization course of is enhancing efficiency and how a lot token utilization accumulates through the run. By the tip, now we have a deployable best_skill.md file, a closing analysis on the unseen validation cut up, and a transparent comparability between the unique and optimized expertise.

Check out the Full Codes with Notebook. Also, be happy to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The submit A Coding Implementation on Microsoft SkillOpt for Instrumented Prompt Optimization, Skill Evolution Analysis, and Baseline Comparison appeared first on MarkTechPost.

A Coding Implementation on Microsoft SkillOpt for Instrumented Prompt Optimization, Skill Evolution Analysis, and Baseline Comparison

SkillChoose Environment Setup

Baseline Skill Evaluation

Training And Visualization

Inspecting Skill Evolution

Final Evaluation Comparison

Conclusion

A Code Implementation to Efficiently Leverage LangChain to Automate PubMed Literature Searches, Parsing, and Trend Visualization

ZenFlow: A New DeepSpeed Extension Designed as a Stall-Free Offloading Engine for Large Language Model (LLM) Training

How to Build Repository-Level Code Intelligence with Repowise Using Graph Analysis, Dead-Code Detection, Decisions, and AI Context

ByteDance Researchers Introduce VGR: A Novel Reasoning Multimodal Large Language Model (MLLM) with Enhanced Fine-Grained Visual Perception Capabilities

NVIDIA aims to solve AI’s issues with many languages

StepFun Introduces Step-Audio-AQAA: A Fully End-to-End Audio Language Model for Natural Voice Interaction

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

SkillChoose Environment Setup

Baseline Skill Evaluation

Training And Visualization

Inspecting Skill Evolution

Final Evaluation Comparison

Conclusion

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!