Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation
In this tutorial, we use GEPA as a reflective prompt-evolution framework to enhance the way in which a language mannequin solves arithmetic phrase issues. We start with a weak seed immediate, create a small deterministic benchmark, outline a structured evaluator, and move actionable suggestions to GEPA so it could possibly perceive why a candidate immediate fails. We additionally use a multi-component immediate setup wherein each the instruction discipline and the output-format guidelines evolve collectively. By the tip, we evaluate the baseline immediate with the optimized immediate on a held-out validation set and examine how the evolutionary course of improves efficiency.
Installing GEPA and LiteLLM and Configuring the Task and Reflection Models
!pip set up -q gepa litellm
import os, re, json, random, getpass, textwrap
import litellm
import gepa.optimize_anything as oa
from gepa.optimize_anything import (
optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig,
)
litellm.suppress_debug_info = True
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
TASK_LM = "openai/gpt-4o-mini"
REFLECTION_LM = "openai/gpt-4.1"
MAX_METRIC_CALLS = 100
We set up GEPA and LiteLLM, then import the required libraries for immediate optimization and mannequin calls. We securely arrange the OpenAI API key and outline two fashions: a job mannequin that solves the issue and a mirrored image mannequin that improves the immediate. We additionally set the utmost metric-call finances to maintain the optimization course of below management.
Building a Deterministic Arithmetic Benchmark Dataset
def make_problems(n, seed=0):
rng = random.Random(seed)
out = []
for _ in vary(n):
t = rng.selection(["discount", "travel", "wallet", "chain"])
if t == "low cost":
unit = rng.selection([40, 60, 80, 120])
qty = rng.selection([5, 6, 8, 10])
disc = rng.selection([10, 20, 25, 50])
whole = unit * qty
gold = whole - whole * disc // 100
q = (f"A store sells notebooks at {unit} rupees every. You purchase {qty} "
f"notebooks and get a {disc}% low cost on the entire invoice. "
f"How many rupees do you pay in whole?")
elif t == "journey":
s1, h1 = rng.selection([40, 50, 60]), rng.selection([2, 3])
s2, h2 = rng.selection([30, 45, 70]), rng.selection([1, 2, 3])
gold = s1 * h1 + s2 * h2
q = (f"A automobile drives at {s1} km/h for {h1} hours, then at {s2} km/h "
f"for {h2} hours. What is the entire distance travelled, in km?")
elif t == "pockets":
tens = rng.selection([3, 5, 7, 9])
fifties= rng.selection([2, 4, 6])
spent = rng.selection([50, 80, 110, 150])
gold = tens * 10 + fifties * 50 - spent
q = (f"You have {tens} ten-rupee notes and {fifties} fifty-rupee "
f"notes. You spend {spent} rupees. How many rupees are left?")
else:
x = rng.selection([6, 9, 12, 15]); y = rng.selection([4, 7, 10]); z = rng.selection([3, 8, 11])
gold = x * 2 - y + z
q = (f"Start with the quantity {x}. Double it, then subtract {y}, "
f"then add {z}. What quantity do you finish with?")
out.append({"query": q, "reply": gold})
return out
all_problems = make_problems(18, seed=42)
random.Random(1).shuffle(all_problems)
trainset = all_problems[:12]
valset = all_problems[12:]
print(f"Dataset: {len(trainset)} prepare / {len(valset)} val problemsn")
We create a small deterministic dataset of arithmetic phrase issues overlaying reductions, journey distance, pockets calculations, and chained operations. We generate the right reply for every drawback programmatically, which retains the benchmark dependable and straightforward to guage. We then shuffle the examples and break up them right into a coaching set for optimization and a validation set for testing generalization.
Defining the Evaluator and Structured Feedback for GEPA
def build_system_prompt(candidate: dict) -> str:
return (f"{candidate['instructions']}nn"
f"OUTPUT FORMAT RULES:n{candidate['format_rules']}")
def call_task_lm(system_prompt: str, query: str) -> str:
for try in vary(3):
strive:
r = litellm.completion(
mannequin=TASK_LM,
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": question}],
temperature=0, max_tokens=600, timeout=60,
)
return r["choices"][0]["message"]["content"] or ""
besides Exception as e:
if try == 2:
return f"[LM_ERROR] {e}"
return ""
def parse_answers(textual content: str):
formatted = re.search(r"####s*(-?d+)", textual content)
all_nums = re.findall(r"-?d+", textual content)
fmt_val = int(formatted.group(1)) if formatted else None
last_val = int(all_nums[-1]) if all_nums else None
return fmt_val, last_val
def consider(candidate: dict, instance: dict):
system = build_system_prompt(candidate)
uncooked = call_task_lm(system, instance["question"])
gold = instance["answer"]
fmt_val, last_val = parse_answers(uncooked)
if fmt_val shouldn't be None and fmt_val == gold:
rating, fb = 1.0, "Correct and accurately formatted."
elif fmt_val shouldn't be None and fmt_val != gold:
rating, fb = 0.0, (f"WRONG ANSWER. You output '#### {fmt_val}' however the "
f"appropriate reply is {gold}. Re-check the arithmetic and "
f"the order of the steps.")
elif last_val == gold:
rating, fb = 0.5, (f"Right quantity ({gold}) however FORMAT VIOLATION: the ultimate "
f"line was not precisely '#### {gold}'. Always finish with a "
f"line of the shape '#### <integer>' and nothing else.")
else:
rating, fb = 0.0, (f"WRONG. Correct reply is {gold}. The mannequin's remaining "
f"quantity was {last_val}. Likely a multi-step reasoning "
f"slip; present every step and confirm earlier than answering.")
oa.log(f"rating={rating} gold={gold} parsed_fmt={fmt_val} parsed_last={last_val}")
side_info = {
"suggestions": fb,
"drawback": instance["question"],
"gold_answer": gold,
"model_output": uncooked[:500],
}
return rating, side_info
def eval_set(candidate, dataset, label=""):
scores, precise, formatted = [], 0, 0
for ex in dataset:
s, data = consider(candidate, ex)
scores.append(s)
if s == 1.0: precise += 1; formatted += 1
elif s == 0.5: formatted += 0
acc = precise / len(dataset)
avg = sum(scores) / len(dataset)
print(f" [{label}] avg_score={avg:.3f} exact_correct+formatted={precise}/{len(dataset)}")
return avg, acc
We outline how the candidate immediate is transformed right into a system immediate and how the duty mannequin receives every query. We additionally create the evaluator that parses the mannequin output, checks whether or not the ultimate reply follows the required #### <integer> format, and assigns a rating. We return structured suggestions as actionable facet info in order that GEPA can decide whether or not the problem is wrong reasoning, poor formatting, or each.
Configuring GEPA and Running the Prompt Optimization
seed_candidate = {
"directions": "Solve the mathematics drawback.",
"format_rules": "Give the reply.",
}
print("=== BASELINE (seed immediate) ===")
print("Train:"); base_train = eval_set(seed_candidate, trainset, "prepare")
print("Val: "); base_val = eval_set(seed_candidate, valset, "val")
print()
goal = (
"Evolve a system immediate (the 'directions' and 'format_rules' fields) so a "
"small LLM reliably solves multi-step arithmetic phrase issues AND at all times "
"ends with a line of precisely the shape '#### <integer>'. Maximize the rating."
)
background = (
"Scoring: 1.0 = appropriate quantity within the precise '#### <int>' format; 0.5 = appropriate "
"quantity however unsuitable/lacking format; 0.0 = unsuitable quantity. Common failures are (a) not "
"emitting the '####' line, and (b) order-of-operations or multi-step slips. The "
"profitable immediate ought to pressure specific step-by-step work, a verification step, and "
"a strict final-answer line."
)
config = GEPAConfig(
engine=EngineConfig(
max_metric_calls=MAX_METRIC_CALLS,
max_workers=4,
parallel=True,
display_progress_bar=True,
seed=0,
),
reflection=ReflectionConfig(
reflection_lm=REFLECTION_LM,
),
)
print("=== RUNNING GEPA (this calls the LLMs; ~1-4 min) ===")
outcome = optimize_anything(
seed_candidate=seed_candidate,
evaluator=consider,
dataset=trainset,
valset=valset,
goal=goal,
background=background,
config=config,
)
We begin with a weak seed immediate and consider its baseline efficiency on each the coaching and validation units. We then outline the optimization goal, background scoring guidelines, and GEPA configuration, together with parallel analysis and the reflection mannequin. Finally, we run optimize_anything so GEPA can evolve the instruction and format-rule fields utilizing the evaluator suggestions.
Comparing the Baseline and GEPA-Optimized Prompts on the Validation Set
finest = outcome.best_candidate
print("n" + "=" * 78)
print("OPTIMIZED CANDIDATE")
print("=" * 78)
print("n--- directions ---n" + textwrap.fill(finest["instructions"], 96))
print("n--- format_rules ---n" + textwrap.fill(finest["format_rules"], 96))
print("n" + "=" * 78)
print("BEFORE vs AFTER (held-out validation set)")
print("=" * 78)
print("Seed immediate:"); _ = eval_set(seed_candidate, valset, "val-seed")
print("GEPA immediate:"); _ = eval_set(finest, valset, "val-gepa")
print(f"nBaseline val avg_score : {base_val[0]:.3f}")
print("n" + "=" * 78)
print("EVOLUTION HISTORY (candidate index -> val rating, mother and father)")
print("=" * 78)
cands = getattr(outcome, "candidates", [])
vscores = getattr(outcome, "val_aggregate_scores", [])
mother and father = getattr(outcome, "mother and father", [None] * len(cands))
for i, sc in enumerate(vscores):
par = mother and father[i] if i < len(mother and father) else None
tag = " <-- BEST" if cands and cands[i] == finest else ""
print(f" cand {i:second}: val_score={sc:.3f} mother and father={par}{tag}")
print(f"nTotal metric calls used : {getattr(outcome, 'total_metric_calls', 'n/a')}")
print(f"Full validation evals : {getattr(outcome, 'num_full_val_evals', 'n/a')}")
print("nDone. Try elevating MAX_METRIC_CALLS or swapping REFLECTION_LM for a stronger mannequin.")
We extract one of the best immediate discovered by GEPA and print its optimized instruction and format-rule elements. We evaluate the seed immediate and the GEPA-optimized immediate on the held-out validation set to examine whether or not the advance transfers to unseen examples. We additionally examine the evolution historical past, validation scores, dad or mum relationships, and whole metric calls to grasp how the immediate improved over the course of optimization.
In conclusion, we used GEPA to indicate how immediate optimization can transfer past handbook trial and error. We created an entire workflow the place a job mannequin solves examples, an evaluator scores the outputs, and a mirrored image mannequin makes use of detailed suggestions to suggest higher prompts. We additionally examined the optimized immediate on unseen validation issues, which helps us assess whether or not the advance generalizes quite than merely becoming the coaching set. Also, we constructed a sensible instance of reflective immediate evolution wherein structured suggestions, strict analysis, and iterative refinement work collectively to provide a stronger, extra dependable immediate.
Check out the Full Codes with Notebook. Also, be at liberty to comply with us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The put up Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation appeared first on MarkTechPost.
