How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

In this tutorial, we construct a whole, production-style LLM workflow utilizing Promptflow inside a Colab setting. We start by establishing a dependable keyring backend to keep away from OS dependency points and securely configure our OpenAI connection. From there, we set up a clear workspace and outline a structured Prompty file that acts because the core LLM element of our pipeline. We then design a class-based flex move that mixes deterministic preprocessing with LLM reasoning, permitting us to inject computed hints into mannequin responses. We additionally allow tracing to monitor every execution step, run each single- and batch-queries, and generate outputs in a structured format. Finally, we prolong the system with an analysis pipeline that leverages an LLM-as-a-judge to rating responses towards anticipated solutions.

Copy Code

!pip set up -q keyrings.alt


import keyring
from keyrings.alt.file import PlaintextKeyring
keyring.set_keyring(PlaintextKeyring())


import os
from promptflow.shopper import PFClient
from promptflow.connections import OpenAIConnection


pf = PFClient()
CONN = "open_ai_connection"
strive:
   pf.connections.get(title=CONN)
   print(f"Using present connection '{CONN}'")
besides Exception:
   pf.connections.create_or_update(
       OpenAIConnection(title=CONN, api_key=os.environ["OPENAI_API_KEY"])
   )
   print(f"Created connection '{CONN}'")

We start by putting in a fallback keyring backend to keep away from dependency points in environments like Colab. We then initialize the Promptflow shopper and test if an OpenAI connection already exists. If not, we create one utilizing the API key from the setting, making certain a reusable and constant connection setup.

Copy Code

!pip set up -q "promptflow>=1.13.0" "promptflow-tracing" "promptflow-tools" openai


import os, sys, json, getpass, textwrap, importlib
from pathlib import Path


if "OPENAI_API_KEY" not in os.environ:
   os.environ["OPENAI_API_KEY"] = getpass.getpass("Paste your OpenAI API key: ")


WORK_DIR = Path("/content material/pf_demo"); WORK_DIR.mkdir(exist_ok=True, dad and mom=True)
os.chdir(WORK_DIR); sys.path.insert(0, str(WORK_DIR))


from promptflow.shopper import PFClient
from promptflow.connections import OpenAIConnection
from promptflow.tracing import start_trace


pf = PFClient()
CONN = "open_ai_connection"
strive:
   pf.connections.get(title=CONN); print(f"Using present connection '{CONN}'")
besides Exception:
   pf.connections.create_or_update(OpenAIConnection(title=CONN, api_key=os.environ["OPENAI_API_KEY"]))
   print(f"Created connection '{CONN}'")

We set up all required Promptflow libraries and arrange the undertaking’s working listing. We securely seize the OpenAI API key if it isn’t already set and configure the setting accordingly. We then reinitialize the Promptflow shopper and be sure that the connection is correctly established for downstream utilization.

Copy Code

(WORK_DIR / "researcher.prompty").write_text("""---
title: Researcher
description: Concise analysis assistant.
mannequin:
 api: chat
 configuration:
   sort: openai
   connection: open_ai_connection
   mannequin: gpt-4o-mini
 parameters:
   temperature: 0.2
   max_tokens: 350
inputs:
 query: {sort: string}
 trace:     {sort: string, default: ""}
pattern:
 query: "What is the velocity of sunshine in vacuum?"
 trace: ""
---
system:
You are a exact analysis assistant. Answer in 1-3 sentences. If a `trace` is given, weave it in.


consumer:
Q: {{query}}
{% if trace %}Hint: {{trace}}{% endif %}
""")


(WORK_DIR / "move.py").write_text(textwrap.dedent('''
   from pathlib import Path
   from promptflow.tracing import hint
   from promptflow.core import Prompty


   BASE = Path(__file__).dad or mum


   @hint
   def safe_calc(expression: str) -> str:
       """A tiny deterministic 'device' the assistant can lean on."""
       if not set(expression) <= set("0123456789+-*/(). "):
           return "unsafe"
       strive: return str(eval(expression))
       besides Exception as e: return f"error:{e}"


   class ResearchAssistant:
       """Class-based flex move. __init__ args develop into move init parameters."""
       def __init__(self, mannequin: str = "gpt-4o-mini"):
           self.mannequin = mannequin
           self.llm = Prompty.load(supply=BASE / "researcher.prompty")


       @hint
       def __call__(self, query: str) -> dict:
           trace = ""
           if "*" in query or "+" in query:
               tokens = [t for t in question.replace("?","").split() if any(c.isdigit() for c in t)]
               expr = "".be part of(tokens)
               if expr:
                   trace = f"computed: {expr} = {safe_calc(expr)}"


           reply = self.llm(query=query, trace=trace)


           return {"query": query, "reply": str(reply).strip(), "hint_used": trace}
'''))


(WORK_DIR / "move.flex.yaml").write_text(
   "$schema: https://azuremlschemas.azureedge.web/promptflow/newest/Flow.schema.jsonn"
   "entry: move:ResearchAssistantn"
)

We outline a Prompty file that constructions how the LLM ought to behave as a concise analysis assistant. We then create a class-based move that mixes a deterministic calculation device with an LLM name, enabling hybrid reasoning. Finally, we register this move utilizing a YAML configuration, making it executable throughout the Promptflow framework.

Copy Code

strive: start_trace()
besides Exception as e: print("hint ui unavailable on Colab — traces nonetheless recorded:", e)


import move as _flow; importlib.reload(_flow)
agent = _flow.ResearchAssistant(mannequin="gpt-4o-mini")


print("n=== Single name ===")
print(json.dumps(agent(query="In one sentence, what's photosynthesis?"), indent=2))
print(json.dumps(agent(query="What is 21 * 19 ?"), indent=2))


knowledge = [
   {"question": "What is the capital of France?",          "expected": "Paris"},
   {"question": "Chemical symbol for gold?",               "expected": "Au"},
   {"question": "Who wrote the play Hamlet?",              "expected": "Shakespeare"},
   {"question": "What is 12 * 11 ?",                       "expected": "132"},
   {"question": "Boiling point of water at sea level (C)?","expected": "100"},
   {"question": "Largest planet in our solar system?",     "expected": "Jupiter"},
]
data_path = WORK_DIR / "knowledge.jsonl"
data_path.write_text("n".be part of(json.dumps(r) for r in knowledge))


print("n=== Batch run ===")
base_run = pf.run(
   move=str(WORK_DIR / "move.flex.yaml"),
   knowledge=str(data_path),
   column_mapping={"query": "${knowledge.query}"},
   stream=True,
)
print(pf.get_details(base_run))

We allow tracing to seize execution particulars and instantiate our analysis assistant move. We check the system with particular person queries to confirm each pure language and arithmetic dealing with. We then put together a dataset and run a batch job in Promptflow, accumulating structured outputs for additional analysis.

Copy Code

(WORK_DIR / "choose.prompty").write_text("""---
title: Judge
mannequin:
 api: chat
 configuration:
   sort: openai
   connection: open_ai_connection
   mannequin: gpt-4o-mini
 parameters:
   temperature: 0
   max_tokens: 150
   response_format: {sort: json_object}
inputs:
 query: {sort: string}
 reply:   {sort: string}
 anticipated: {sort: string}
---
system:
You are an exacting grader. Decide whether or not the assistant's reply comprises the anticipated truth (case-insensitive, permitting cheap phrasing/synonyms). Reply ONLY as JSON: {"rating": 0 or 1, "purpose": "..."}.


consumer:
Question: {{query}}
Expected: {{anticipated}}
Answer:   {{reply}}
""")


(WORK_DIR / "eval_flow.py").write_text(textwrap.dedent('''
   import json
   from pathlib import Path
   from promptflow.tracing import hint
   from promptflow.core import Prompty


   BASE = Path(__file__).dad or mum


   class Evaluator:
       def __init__(self):
           self.choose = Prompty.load(supply=BASE / "choose.prompty")


       @hint
       def __call__(self, query: str, reply: str, anticipated: str) -> dict:
           uncooked = self.choose(query=query, reply=reply, anticipated=anticipated)
           if isinstance(uncooked, str):
               strive: uncooked = json.hundreds(uncooked)
               besides Exception: uncooked = {"rating": 0, "purpose": f"unparseable:{uncooked[:80]}"}
           return {"rating": int(uncooked.get("rating", 0)), "purpose": str(uncooked.get("purpose",""))}


       def __aggregate__(self, line_results):
           """Run-level aggregation. Whatever this returns exhibits up in pf.get_metrics()."""
           scores = [r["score"] for r in line_results if r]
           return {
               "accuracy": (sum(scores) / len(scores)) if scores else 0.0,
               "handed":   sum(scores),
               "complete":    len(scores),
           }
'''))


(WORK_DIR / "eval.flex.yaml").write_text(
   "$schema: https://azuremlschemas.azureedge.web/promptflow/newest/Flow.schema.jsonn"
   "entry: eval_flow:Evaluatorn"
)


print("n=== Evaluation run ===")
eval_run = pf.run(
   move=str(WORK_DIR / "eval.flex.yaml"),
   knowledge=str(data_path),
   run=base_run,
   column_mapping={
       "query": "${knowledge.query}",
       "anticipated": "${knowledge.anticipated}",
       "reply":   "${run.outputs.reply}",
   },
   stream=True,
)


eval_details = pf.get_details(eval_run)
print(eval_details)


print("n=== Aggregated metrics (from __aggregate__) ===")
print(json.dumps(pf.get_metrics(eval_run), indent=2))


import pandas as pd
if "outputs.rating" in eval_details.columns:
   s = pd.to_numeric(eval_details["outputs.score"], errors="coerce").fillna(0)
   print(f"Manual accuracy: {s.imply():.2%}  ({int(s.sum())}/{len(s)})")

We create a judging Prompty that evaluates mannequin outputs towards anticipated solutions utilizing structured JSON responses. We implement an evaluator class that parses outcomes, computes scores, and defines an aggregation methodology for total metrics. Also, we run the analysis pipeline, hyperlink it to the bottom run, and compute accuracy each by Promptflow metrics and a guide fallback.

In conclusion, we constructed a strong, modular LLM pipeline that extends past primary prompt-response interactions. We built-in deterministic instruments, structured prompting, and reusable move elements to create a system that’s each clear and scalable. Through batch execution and linked analysis runs, we established a transparent suggestions loop that helps us measure efficiency utilizing accuracy metrics and detailed reasoning. The inclusion of tracing and aggregation features allows us to debug, monitor, and enhance the system effectively. Also, this workflow demonstrates how we will design dependable, end-to-end LLM purposes with sturdy foundations in construction, analysis, and reproducibility.

Check out the FULL CODES here. Also, be at liberty to observe us on Twitter and don’t overlook to be part of our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI appeared first on MarkTechPost.