A Complete Workflow for Automated Prompt Optimization Using Gemini Flash, Few-Shot Selection, and Evolutionary Instruction Search
In this tutorial, we shift from traditional prompt crafting to a more systematic, programmable approach by treating prompts as tunable parameters rather than static text. Instead of guessing which instruction or example works best, we build an optimization loop around Gemini 2.0 Flash that experiments, evaluates, and automatically selects the strongest prompt configuration. In this implementation, we watch our model improve step by step, demonstrating how prompt engineering becomes far more powerful when we orchestrate it with data-driven search rather than intuition. Check out the Full Codes here.
import google.generativeai as genai
import json
import random
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
import numpy as np
from collections import Counter
def setup_gemini(api_key: str = None):
if api_key is None:
api_key = input("Enter your Gemini API key: ").strip()
genai.configure(api_key=api_key)
model = genai.GenerativeModel('gemini-2.0-flash-exp')
print("✓ Gemini 2.0 Flash configured")
return model
@dataclass
class Example:
text: str
sentiment: str
def to_dict(self):
return {"text": self.text, "sentiment": self.sentiment}
@dataclass
class Prediction:
sentiment: str
reasoning: str = ""
confidence: float = 1.0
We import all required libraries and define the setup_gemini helper to configure Gemini 2.0 Flash. We also create the Example and Prediction data classes to represent dataset entries and model outputs in a clean, structured way. Check out the Full Codes here.
def create_dataset() -> Tuple[List[Example], List[Example]]:
train_data = [
Example("This movie was absolutely fantastic! Best film of the year.", "positive"),
Example("Terrible experience, waste of time and money.", "negative"),
Example("The product works as expected, nothing special.", "neutral"),
Example("I'm blown away by the quality and attention to detail!", "positive"),
Example("Disappointing and overpriced. Would not recommend.", "negative"),
Example("It's okay, does the job but could be better.", "neutral"),
Example("Incredible customer service and amazing results!", "positive"),
Example("Complete garbage, broke after one use.", "negative"),
Example("Average product, met my basic expectations.", "neutral"),
Example("Revolutionary! This changed everything for me.", "positive"),
Example("Frustrating bugs and poor design choices.", "negative"),
Example("Decent quality for the price point.", "neutral"),
Example("Exceeded all my expectations, truly remarkable!", "positive"),
Example("Worst purchase I've ever made, avoid at all costs.", "negative"),
Example("It's fine, nothing to complain about really.", "neutral"),
Example("Absolutely stellar performance, 5 stars!", "positive"),
Example("Broken and unusable, total disaster.", "negative"),
Example("Meets requirements, standard quality.", "neutral"),
]
val_data = [
Example("Absolutely love it, couldn't be happier!", "positive"),
Example("Broken on arrival, very upset.", "negative"),
Example("Works fine, no major issues.", "neutral"),
Example("Outstanding performance and great value!", "positive"),
Example("Regret buying this, total letdown.", "negative"),
Example("Adequate for basic use.", "neutral"),
]
return train_data, val_data
class PromptTemplate:
def __init__(self, instruction: str = "", examples: List[Example] = None):
self.instruction = instruction
self.examples = examples or []
def format(self, text: str) -> str:
prompt_parts = []
if self.instruction:
prompt_parts.append(self.instruction)
if self.examples:
prompt_parts.append("nExamples:")
for ex in self.examples:
prompt_parts.append(f"nText: {ex.text}")
prompt_parts.append(f"Sentiment: {ex.sentiment}")
prompt_parts.append(f"nText: {text}")
prompt_parts.append("Sentiment:")
return "n".join(prompt_parts)
def clone(self):
return PromptTemplate(self.instruction, self.examples.copy())
We generate a small but diverse sentiment dataset for training and validation using the create_dataset function. We then define PromptTemplate, which lets us assemble instructions, a few-shot examples, and a current query into a single prompt string. We treat the template as a programmable object so we can swap instructions and examples during optimization. Check out the Full Codes here.
class SentimentModel:
def __init__(self, model, prompt_template: PromptTemplate):
self.model = model
self.prompt_template = prompt_template
def predict(self, text: str) -> Prediction:
prompt = self.prompt_template.format(text)
try:
response = self.model.generate_content(prompt)
result = response.text.strip().lower()
for sentiment in ['positive', 'negative', 'neutral']:
if sentiment in result:
return Prediction(sentiment=sentiment, reasoning=result)
return Prediction(sentiment='neutral', reasoning=result)
except Exception as e:
return Prediction(sentiment='neutral', reasoning=str(e))
def evaluate(self, dataset: List[Example]) -> float:
correct = 0
for example in dataset:
pred = self.predict(example.text)
if pred.sentiment == example.sentiment:
correct += 1
return (correct / len(dataset)) * 100
We wrap Gemini in the SentimentModel class so we can call it like a regular classifier. We format prompts via the template, call generate_content, and post-process the text to extract one of three sentiments. We also add an evaluate method so we can measure accuracy over any dataset with a single call. Check out the Full Codes here.
class PromptOptimizer:
def __init__(self, model):
self.model = model
self.instruction_candidates = [
"Analyze the sentiment of the following text. Classify as positive, negative, or neutral.",
"Classify the sentiment: positive, negative, or neutral.",
"Determine if this text expresses positive, negative, or neutral sentiment.",
"What is the emotional tone? Answer: positive, negative, or neutral.",
"Sentiment classification (positive/negative/neutral):",
"Evaluate sentiment and respond with exactly one word: positive, negative, or neutral.",
]
def select_best_examples(self, train_data: List[Example], val_data: List[Example], n_examples: int = 3) -> List[Example]:
best_examples = None
best_score = 0
for _ in range(10):
examples_by_sentiment = {
'positive': [e for e in train_data if e.sentiment == 'positive'],
'negative': [e for e in train_data if e.sentiment == 'negative'],
'neutral': [e for e in train_data if e.sentiment == 'neutral']
}
selected = []
for sentiment in ['positive', 'negative', 'neutral']:
if examples_by_sentiment[sentiment]:
selected.append(random.choice(examples_by_sentiment[sentiment]))
remaining = [e for e in train_data if e not in selected]
while len(selected) < n_examples and remaining:
selected.append(random.choice(remaining))
remaining.remove(selected[-1])
template = PromptTemplate(instruction=self.instruction_candidates[0], examples=selected)
test_model = SentimentModel(self.model, template)
score = test_model.evaluate(val_data[:3])
if score > best_score:
best_score = score
best_examples = selected
return best_examples
def optimize_instruction(self, examples: List[Example], val_data: List[Example]) -> str:
best_instruction = self.instruction_candidates[0]
best_score = 0
for instruction in self.instruction_candidates:
template = PromptTemplate(instruction=instruction, examples=examples)
test_model = SentimentModel(self.model, template)
score = test_model.evaluate(val_data)
if score > best_score:
best_score = score
best_instruction = instruction
return best_instruction
We introduce the PromptOptimizer class and define a pool of candidate instructions to test. We implement select_best_examples to search for a small, diverse set of few-shot examples and optimize_instruction to score each instruction variant on validation data. We are effectively turning prompt design into a lightweight search problem over examples and instructions. Check out the Full Codes here.
def compile(self, train_data: List[Example], val_data: List[Example], n_examples: int = 3) -> PromptTemplate:
best_examples = self.select_best_examples(train_data, val_data, n_examples)
best_instruction = self.optimize_instruction(best_examples, val_data)
optimized_template = PromptTemplate(instruction=best_instruction, examples=best_examples)
return optimized_template
def main():
print("="*70)
print("Prompt Optimization Tutorial")
print("Stop Writing Prompts, Start Programming Them!")
print("="*70)
model = setup_gemini()
train_data, val_data = create_dataset()
print(f"✓ {len(train_data)} training examples, {len(val_data)} validation examples")
baseline_template = PromptTemplate(
instruction="Classify sentiment as positive, negative, or neutral.",
examples=[]
)
baseline_model = SentimentModel(model, baseline_template)
baseline_score = baseline_model.evaluate(val_data)
manual_examples = train_data[:3]
manual_template = PromptTemplate(
instruction="Classify sentiment as positive, negative, or neutral.",
examples=manual_examples
)
manual_model = SentimentModel(model, manual_template)
manual_score = manual_model.evaluate(val_data)
optimizer = PromptOptimizer(model)
optimized_template = optimizer.compile(train_data, val_data, n_examples=4)
We add the compile method to combine the best examples and best instructions into a final optimized PromptTemplate. Inside main, we configure Gemini, build the dataset, and evaluate both a zero-shot baseline and a simple manual few-shot prompt. We then call the optimizer to produce our compiled, optimized prompt for sentiment analysis. Check out the Full Codes here.
optimized_model = SentimentModel(model, optimized_template)
optimized_score = optimized_model.evaluate(val_data)
print(f"Baseline (zero-shot): {baseline_score:.1f}%")
print(f"Manual few-shot: {manual_score:.1f}%")
print(f"Optimized (compiled): {optimized_score:.1f}%")
print(f"nInstruction: {optimized_template.instruction}")
print(f"nSelected Examples ({len(optimized_template.examples)}):")
for i, ex in enumerate(optimized_template.examples, 1):
print(f"n{i}. Text: {ex.text}")
print(f" Sentiment: {ex.sentiment}")
test_cases = [
"This is absolutely amazing, I love it!",
"Completely broken and unusable.",
"It works as advertised, no complaints."
]
for test_text in test_cases:
print(f"nInput: {test_text}")
pred = optimized_model.predict(test_text)
print(f"Predicted: {pred.sentiment}")
print("✓ Tutorial Complete!")
if __name__ == "__main__":
main()
We evaluate the optimized model and compare its accuracy against the baseline and manual few-shot setups. We print the chosen instruction and the selected examples so we can inspect what the optimizer discovers, and then we run a few live test sentences to see predictions in action. We finish by summarizing the improvements and reinforcing the idea that prompts can be tuned programmatically rather than written by hand.
In conclusion, we implemented how programmatic prompt optimization provides a repeatable, evidence-driven workflow for designing high-performing prompts. We began with a fragile baseline, then iteratively tested instructions, selected diverse examples, and compiled an optimized template that outperforms manual attempts. This process shows that we no longer rely on trial-and-error prompting; instead, we orchestrated a controlled optimization cycle. Also, we can extend this pipeline to new tasks, richer datasets, and more advanced scoring methods, allowing us to engineer prompts with precision, confidence, and scalability.
Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Complete Workflow for Automated Prompt Optimization Using Gemini Flash, Few-Shot Selection, and Evolutionary Instruction Search appeared first on MarkTechPost.
