An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows

In this tutorial, we implement an entire workflow for constructing, tracing, and evaluating an LLM pipeline utilizing Opik. We construction the system step-by-step, starting with a light-weight mannequin, including prompt-based planning, making a dataset, and lastly working automated evaluations. As we transfer by way of every snippet, we see how Opik helps us monitor each operate span, visualize the pipeline’s conduct, and measure output high quality with clear, reproducible metrics. By the top, now we have a completely instrumented QA system that we will prolong, examine, and monitor with ease. Check out the FULL CODES here.

Copy Code

!pip set up -q opik transformers speed up torch


import torch
from transformers import pipeline
import textwrap


import opik
from opik import Opik, Prompt, monitor
from opik.analysis import consider
from opik.analysis.metrics import Equals, LevenshteinRatio


machine = 0 if torch.cuda.is_available() else -1
print("Using machine:", "cuda" if machine == 0 else "cpu")


opik.configure()
PROJECT_NAME = "opik-hf-tutorial"

We arrange our surroundings by putting in the required libraries and initializing Opik. We load the core modules, detect the machine, and configure our venture so that each hint flows into the right workspace. We lay the inspiration for the remaining of the tutorial. Check out the FULL CODES here.

Copy Code

llm = pipeline(
   "text-generation",
   mannequin="distilgpt2",
   machine=machine,
)


def hf_generate(immediate: str, max_new_tokens: int = 80) -> str:
   consequence = llm(
       immediate,
       max_new_tokens=max_new_tokens,
       do_sample=True,
       temperature=0.3,
       pad_token_id=llm.tokenizer.eos_token_id,
   )[0]["generated_text"]
   return consequence[len(prompt):].strip()

We load a light-weight Hugging Face mannequin and create a small helper operate to generate textual content cleanly. We put together the LLM to function regionally with out exterior APIs. This offers us a dependable and reproducible era layer for the remaining of the pipeline. Check out the FULL CODES here.

Copy Code

plan_prompt = Prompt(
   identify="hf_plan_prompt",
   immediate=textwrap.dedent("""
       You are an assistant that creates a plan to reply a query
       utilizing ONLY the given context.


       Context:
       {{context}}


       Question:
       {{query}}


       Return precisely 3 bullet factors as a plan.
   """).strip(),
)


answer_prompt = Prompt(
   identify="hf_answer_prompt",
   immediate=textwrap.dedent("""
       You reply based mostly solely on the given context.


       Context:
       {{context}}


       Question:
       {{query}}


       Plan:
       {{plan}}


       Answer the query in 2–4 concise sentences.
   """).strip(),
)

We outline two structured prompts utilizing Opik’s Prompt class. We management the planning section and answering section by way of clear templates. This helps us keep consistency and observe how structured prompting impacts mannequin conduct. Check out the FULL CODES here.

Copy Code

DOCS = {
   "overview": """
       Opik is an open-source platform for debugging, evaluating,
       and monitoring LLM and RAG functions. It gives tracing,
       datasets, experiments, and analysis metrics.
   """,
   "tracing": """
       Tracing in Opik logs nested spans, LLM calls, token utilization,
       suggestions scores, and metadata to examine complicated LLM pipelines.
   """,
   "analysis": """
       Opik evaluations are outlined by datasets, analysis duties,
       scoring metrics, and experiments that mixture scores,
       serving to detect regressions or points.
   """,
}


@monitor(project_name=PROJECT_NAME, kind="instrument", identify="retrieve_context")
def retrieve_context(query: str) -> str:
   q = query.decrease()
   if "hint" in q or "span" in q:
       return DOCS["tracing"]
   if "metric" in q or "dataset" in q or "consider" in q:
       return DOCS["evaluation"]
   return DOCS["overview"]

We assemble a tiny doc retailer and a retrieval operate that Opik tracks as a instrument. We let the pipeline choose context based mostly on the consumer’s query. This permits us to simulate a minimal RAG-style workflow while not having an precise vector database. Check out the FULL CODES here.

Copy Code

@monitor(project_name=PROJECT_NAME, kind="llm", identify="plan_answer")
def plan_answer(context: str, query: str) -> str:
   rendered = plan_prompt.format(context=context, query=query)
   return hf_generate(rendered, max_new_tokens=80)


@monitor(project_name=PROJECT_NAME, kind="llm", identify="answer_from_plan")
def answer_from_plan(context: str, query: str, plan: str) -> str:
   rendered = answer_prompt.format(
       context=context,
       query=query,
       plan=plan,
   )
   return hf_generate(rendered, max_new_tokens=120)


@monitor(project_name=PROJECT_NAME, kind="common", identify="qa_pipeline")
def qa_pipeline(query: str) -> str:
   context = retrieve_context(query)
   plan = plan_answer(context, query)
   reply = answer_from_plan(context, query, plan)
   return reply


print("Sample reply:n", qa_pipeline("What does Opik assist builders do?"))

We carry collectively planning, reasoning, and answering in a completely traced LLM pipeline. We seize every step with Opik’s decorators so we will analyze spans within the dashboard. By testing the pipeline, we affirm that each one elements combine easily. Check out the FULL CODES here.

Copy Code

consumer = Opik()


dataset = consumer.get_or_create_dataset(
   identify="HF_Opik_QA_Dataset",
   description="Small QA dataset for HF + Opik tutorial",
)


dataset.insert([
   {
       "question": "What kind of platform is Opik?",
       "context": DOCS["overview"],
       "reference": "Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG functions.",
   },
   {
       "query": "What does tracing in Opik log?",
       "context": DOCS["tracing"],
       "reference": "Tracing logs nested spans, LLM calls, token utilization, suggestions scores, and metadata.",
   },
   {
       "query": "What are the elements of an Opik analysis?",
       "context": DOCS["evaluation"],
       "reference": "An Opik analysis makes use of datasets, analysis duties, scoring metrics and experiments that mixture scores.",
   },
])

We create and populate a dataset inside Opik that our analysis will use. We insert a number of query–reply pairs that cowl completely different elements of Opik. This dataset will function the bottom fact for our QA analysis later. Check out the FULL CODES here.

Copy Code

equals_metric = Equals()
lev_metric = LevenshteinRatio()


def evaluation_task(merchandise: dict) -> dict:
   output = qa_pipeline(merchandise["question"])
   return {
       "output": output,
       "reference": merchandise["reference"],
   }

We outline the analysis job and choose two metrics—Equals and LevenshteinRatio—to measure mannequin high quality. We guarantee the duty produces outputs within the precise format required for scoring. This connects our pipeline to Opik’s analysis engine. Check out the FULL CODES here.

Copy Code

evaluation_result = consider(
   dataset=dataset,
   job=evaluation_task,
   scoring_metrics=[equals_metric, lev_metric],
   experiment_name="HF_Opik_QA_Experiment",
   project_name=PROJECT_NAME,
   task_threads=1,
)


print("nExperiment URL:", evaluation_result.experiment_url)

We run the analysis experiment utilizing Opik’s consider operate. We maintain the execution sequential for stability in Colab. Once full, we obtain a hyperlink to view the experiment particulars contained in the Opik dashboard. Check out the FULL CODES here.

Copy Code

agg = evaluation_result.aggregate_evaluation_scores()


print("nAggregated scores:")
for metric_name, stats in agg.aggregated_scores.objects():
   print(metric_name, "=>", stats)

We mixture and print the analysis scores to know how effectively our pipeline performs. We examine the metric outcomes to see the place outputs align with references and the place enhancements are wanted. This closes the loop on our absolutely instrumented LLM workflow.

In conclusion, we arrange a small however absolutely useful LLM analysis ecosystem powered fully by Opik and an area mannequin. We observe how traces, prompts, datasets, and metrics come collectively to present us clear visibility into the mannequin’s reasoning course of. As we finalize our analysis and evaluate the aggregated scores, we respect how Opik lets us iterate shortly, experiment systematically, and validate enhancements in a structured and dependable approach.

Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows appeared first on MarkTechPost.