How to Build and Evolve a Custom OpenAI Agent with A-Evolve Using Benchmarks, Skills, Memory, and Workspace Mutations

In this tutorial, we work immediately with the A-Evolve framework in Colab and construct a full evolutionary agent pipeline from the bottom up. We arrange the repository, configure an OpenAI-powered agent, outline a customized benchmark, and construct our personal evolution engine to see how A-Evolve truly improves an agent by way of iterative workspace mutations. Through the code, we use the framework’s core abstractions for prompts, expertise, reminiscence, benchmarking, and evolution, which assist us perceive not simply how to run A-Evolve, but additionally how to prolong it in a sensible, Colab-friendly means.

Copy Code

import os
import sys
import json
import textwrap
import subprocess
import shutil
from pathlib import Path
from getpass import getpass
from collections import Counter, defaultdict


subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "openai>=1.30.0", "pyyaml>=6.0", "matplotlib>=3.8"])
REPO_DIR = Path("/content material/a-evolve")
if REPO_DIR.exists():
   shutil.rmtree(REPO_DIR)
subprocess.check_call(["git", "clone", "--depth", "1", "https://github.com/A-EVO-Lab/a-evolve.git", str(REPO_DIR)])
sys.path.insert(0, str(REPO_DIR))


if not os.environ.get("OPENAI_API_KEY"):
   os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ").strip()


OPENAI_MODEL = "gpt-4o-mini"


import yaml
import matplotlib.pyplot as plt


import agent_evolve as ae
from agent_evolve.protocol.base_agent import BaseAgent
from agent_evolve.benchmarks.base import BenchmarkAdapter
from agent_evolve.engine.base import EvolutionEngine
from agent_evolve.sorts import Task, Trajectory, Feedback, StepResult
from agent_evolve.contract.workspace import AgentWorkspace
from openai import OpenAI


shopper = OpenAI(api_key=os.environ["OPENAI_API_KEY"])


WORKSPACE_ROOT = Path("/content material/a_evolve_demo_workspace")
if WORKSPACE_ROOT.exists():
   shutil.rmtree(WORKSPACE_ROOT)


(WORKSPACE_ROOT / "prompts").mkdir(mother and father=True, exist_ok=True)
(WORKSPACE_ROOT / "expertise").mkdir(mother and father=True, exist_ok=True)
(WORKSPACE_ROOT / "reminiscence").mkdir(mother and father=True, exist_ok=True)
(WORKSPACE_ROOT / "instruments").mkdir(mother and father=True, exist_ok=True)


manifest = {
   "title": "colab-aevolve-demo-agent",
   "model": "0.1.0",
   "contract_version": "1.0",
   "agent": {
       "sort": "customized",
       "entrypoint": None
   },
   "evolvable_layers": ["prompts", "skills", "memory"],
   "reload_strategy": "sizzling"
}
with open(WORKSPACE_ROOT / "manifest.yaml", "w") as f:
   yaml.dump(manifest, f, sort_keys=False)


initial_system_prompt = textwrap.dedent("""
You are a exact text-transformation agent.


Solve the duty precisely.
Be concise.
Return solely the ultimate reply with no clarification except the duty explicitly asks for JSON.
""").strip()


(WORKSPACE_ROOT / "prompts" / "system.md").write_text(initial_system_prompt)

We put together the complete Colab atmosphere wanted to run the tutorial from begin to end. We set up the required packages, clone the A-Evolve repository, load the framework imports, and securely gather the OpenAI API key for mannequin entry. We additionally outline the workspace construction and initialize the manifest and system immediate, offering our evolving agent with a legitimate place to begin throughout the A-Evolve framework.

Copy Code

def build_dataset():
   practice = [
       {
           "id": "train-01",
           "rule": "json_sum",
           "input": "Numbers: 7, 11, 4",
           "answer": '{"sum":22}'
       },
       {
           "id": "train-02",
           "rule": "json_sum",
           "input": "Numbers: 20, 5, 3, 2",
           "answer": '{"sum":30}'
       },
       {
           "id": "train-03",
           "rule": "acronym_upper",
           "input": "Create the acronym from: retrieval augmented generation",
           "answer": "RAG"
       },
       {
           "id": "train-04",
           "rule": "acronym_upper",
           "input": "Create the acronym from: large language model",
           "answer": "LLM"
       },
       cherry"
       ,
       lion,
       {
           "id": "train-07",
           "rule": "vowel_parity",
           "input": "Word: equation",
           "answer": "EVEN"
       },
       {
           "id": "train-08",
           "rule": "vowel_parity",
           "input": "Word: education",
           "answer": "ODD"
       },
   ]


   holdout = [
       {
           "id": "holdout-01",
           "rule": "json_sum",
           "input": "Numbers: 100, 1, 9",
           "answer": '{"sum":110}'
       },
       {
           "id": "holdout-02",
           "rule": "acronym_upper",
           "input": "Create the acronym from: artificial general intelligence",
           "answer": "AGI"
       },
       mango"
       ,
       {
           "id": "holdout-04",
           "rule": "vowel_parity",
           "input": "Word: aeroplane",
           "answer": "ODD"
       },
   ]
   return practice, holdout


TRAIN_DATA, HOLDOUT_DATA = build_dataset()


def normalize_text(x: str) -> str:
   return x.strip().change(" ", "")


class MiniTextBenchmark(BenchmarkAdapter):
   def __init__(self):
       self.practice = TRAIN_DATA
       self.holdout = HOLDOUT_DATA


   def get_tasks(self, cut up: str = "practice", restrict: int = 10):
       knowledge = self.practice if cut up == "practice" else self.holdout
       duties = []
       for row in knowledge[:limit]:
           duties.append(
               Task(
                   id=row["id"],
                   enter=row["input"],
                   metadata={
                       "rule": row["rule"],
                       "reply": row["answer"]
                   }
               )
           )
       return duties


   def consider(self, process: Task, trajectory: Trajectory):
       pred = trajectory.output.strip()
       gold = process.metadata["answer"].strip()
       success = normalize_text(pred) == normalize_text(gold)
       element = {
           "rule": process.metadata["rule"],
           "gold": gold,
           "pred": pred,
           "enter": process.enter,
           "success": success
       }
       rating = 1.0 if success else 0.0
       return Feedback(
           success=success,
           rating=rating,
           element=json.dumps(element, ensure_ascii=False),
           uncooked=element
       )


SKILL_ROUTING = {
   "json_sum": ["json", "sum"],
   "acronym_upper": ["acronym", "uppercase"],
   "pipe_unique_sorted_lower": ["unique", "sorted", "lowercase", "pipe"],
   "vowel_parity": ["vowel", "odd", "even", "parity"]
}

We outline the coaching and holdout datasets used to measure the agent earlier than and after evolution. We construct a customized benchmark class that packages every instance into A-Evolve duties and evaluates predictions towards precise anticipated outputs. We additionally arrange the routing hints for expertise, which prepares the system to join completely different process sorts with the precise behavioral patterns later within the workflow.

Copy Code

class ColabAEResolverAgent(BaseAgent):
   def __init__(self, workspace_dir: str | Path, mannequin: str = OPENAI_MODEL):
       self.mannequin = mannequin
       tremendous().__init__(workspace_dir)


   def _pick_relevant_skills(self, process: Task):
       rule = process.metadata.get("rule", "")
       chosen = []
       for ability in self.expertise:
           hay = f"{ability.title} {ability.description}".decrease()
           if rule == "json_sum" and ("json" in hay or "sum" in hay):
               chosen.append(ability)
           elif rule == "acronym_upper" and ("acronym" in hay or "uppercase" in hay):
               chosen.append(ability)
           elif rule == "pipe_unique_sorted_lower" and any(ok in hay for ok in ["unique", "sorted", "lowercase", "pipe"]):
               chosen.append(ability)
           elif rule == "vowel_parity" and any(ok in hay for ok in ["vowel", "odd", "even", "parity"]):
               chosen.append(ability)
       return chosen[:3]


   def resolve(self, process: Task) -> Trajectory:
       relevant_skills = self._pick_relevant_skills(process)
       relevant_skill_texts = []
       for s in relevant_skills:
           relevant_skill_texts.append(self.get_skill_content(s.title))


       memory_text = "n".be a part of(
           [f"- {m.get('content', '')}" for m in self.memories[-8:]]
       ).strip()


       skill_block = "nn".be a part of(relevant_skill_texts).strip()
       if not skill_block:
           skill_block = "(no expertise loaded but)"


       if not memory_text:
           memory_text = "(no reminiscence but)"


       user_prompt = textwrap.dedent(f"""
       TASK RULE: {process.metadata.get("rule")}
       TASK INPUT:
       {process.enter}


       ACTIVE SYSTEM PROMPT:
       {self.system_prompt}


       RELEVANT SKILLS:
       {skill_block}


       RECENT MEMORIES:
       {memory_text}


       Solve the duty precisely.
       Return solely the ultimate reply.
       """).strip()


       response = shopper.chat.completions.create(
           mannequin=self.mannequin,
           temperature=0,
           messages=[
               {"role": "system", "content": "You are an exact text-transformation agent."},
               {"role": "user", "content": user_prompt}
           ]
       )


       output = (response.decisions[0].message.content material or "").strip()


       self.bear in mind(
           content material=f"Task {process.id} below rule {process.metadata.get('rule')} produced output: {output}",
           class="episodic"
       )


       return Trajectory(
           task_id=process.id,
           output=output,
           steps=[
               {
                   "rule": task.metadata.get("rule"),
                   "used_skills": [s.name for s in relevant_skills],
                   "system_prompt_chars": len(self.system_prompt),
                   "memory_items_seen": len(self.recollections)
               }
           ]
       )


SKILL_TEMPLATES = {
   "json_sum": textwrap.dedent("""
       ---
       title: json-sum-exact
       description: Add all integers and output strict compact JSON with the one key sum.
       ---
       # JSON Sum Exact


       Procedure:
       1. Extract all integers from the duty enter.
       2. Add them.
       3. Return precisely one compact JSON object on this format:
          {"sum":NUMBER}
       4. Do not add areas, explanations, markdown, or further keys.
   """).strip(),


   "acronym_upper": textwrap.dedent("""
       ---
       title: acronym-upper-exact
       description: Build an uppercase acronym by taking the primary letter of every phrase.
       ---
       # Acronym Upper Exact


       Procedure:
       1. Identify the phrase after the colon.
       2. Take the primary letter of every phrase.
       3. Convert each letter to uppercase.
       4. Return solely the ultimate acronym, with no punctuation or clarification.
   """).strip(),


   "pipe_unique_sorted_lower": textwrap.dedent("""
       ---
       title: pipe-unique-sorted-lower
       description: Normalize tokens to lowercase, deduplicate them, type ascending, and be a part of them with pipes.
       ---
       # Pipe Unique Sorted Lower


       Procedure:
       1. Read the token checklist after the colon.
       2. Split by commas.
       3. Trim areas and lowercase each token.
       4. Remove duplicates.
       5. Sort alphabetically ascending.
       6. Join with "|" and return solely the ultimate string.
   """).strip(),


   "vowel_parity": textwrap.dedent("""
       ---
       title: vowel-parity-exact
       description: Count vowels within the phrase and output ODD or EVEN solely.
       ---
       # Vowel Parity Exact


       Procedure:
       1. Read the goal phrase after the colon.
       2. Count vowels utilizing a, e, i, o, u.
       3. If the rely is odd, output ODD.
       4. If the rely is even, output EVEN.
       5. Return solely ODD or EVEN with no further textual content.
   """).strip(),
}


PROMPT_APPENDIX = textwrap.dedent("""
## STRICT OUTPUT CONTRACT
- Output solely the ultimate reply.
- Never clarify your reasoning.
- If a process expects JSON, return compact JSON with precise keys solely.
- When a related ability exists, comply with it actually.
- Exact format is extra necessary than being conversational.
""").strip()

We implement the customized A-Evolve agent that reads the lively immediate, expertise, and reminiscence from the workspace and makes use of OpenAI to resolve every process. We design the agent so it selects related expertise, injects current reminiscence, and returns trajectories within the construction anticipated by the framework. We additionally outline the ability templates and the strict output contract, which function the primary substances that the evolution engine can add to enhance efficiency over time.

Copy Code

class ColabMutationEngine(EvolutionEngine):
   def __init__(self):
       self.cycle_count = 0


   def step(self, workspace: AgentWorkspace, observations, historical past, trial):
       self.cycle_count += 1


       failed_by_rule = defaultdict(checklist)
       for obs in observations:
           if not obs.suggestions.success:
               failed_by_rule[obs.task.metadata["rule"]].append({
                   "task_id": obs.process.id,
                   "enter": obs.process.enter,
                   "gold": obs.process.metadata["answer"],
                   "pred": obs.trajectory.output
               })


       mutated = False
       summaries = []


       current_prompt = workspace.read_prompt()
       if "STRICT OUTPUT CONTRACT" not in current_prompt:
           workspace.write_prompt(current_prompt.rstrip() + "nn" + PROMPT_APPENDIX + "n")
           mutated = True
           summaries.append("immediate hardened")


       existing_skill_names = {s.title for s in workspace.list_skills()}


       needed_rule_to_skill_name = {
           "json_sum": "json-sum-exact",
           "acronym_upper": "acronym-upper-exact",
           "pipe_unique_sorted_lower": "pipe-unique-sorted-lower",
           "vowel_parity": "vowel-parity-exact",
       }


       for rule, fails in failed_by_rule.gadgets():
           skill_name = needed_rule_to_skill_name[rule]
           if skill_name not in existing_skill_names:
               workspace.write_skill(skill_name, SKILL_TEMPLATES[rule])
               mutated = True
               summaries.append(f"added ability {skill_name}")


           workspace.add_memory({
               "content material": f"Cycle {self.cycle_count}: rule={rule} failed {len(fails)} time(s). Common failure sample: output formatting or process mismatch. Gold examples have to be adopted precisely.",
               "rule": rule,
               "examples": fails[:2]
           }, class="episodic")


       if not failed_by_rule:
           workspace.add_memory({
               "content material": f"Cycle {self.cycle_count}: all present coaching duties succeeded. Preserve precise formatting conduct."
           }, class="episodic")


       abstract = " | ".be a part of(summaries) if summaries else "no mutation wanted"
       return StepResult(
           mutated=mutated,
           abstract=abstract,
           metadata={
               "failed_rules": checklist(failed_by_rule.keys()),
               "num_failed_rules": len(failed_by_rule),
               "cycle": self.cycle_count
           }
       )


def evaluate_split(agent, benchmark, cut up="practice"):
   duties = benchmark.get_tasks(cut up=cut up, restrict=100)
   rows = []
   complete = 0
   right = 0
   for process in duties:
       traj = agent.resolve(process)
       fb = benchmark.consider(process, traj)
       rows.append({
           "task_id": process.id,
           "rule": process.metadata["rule"],
           "enter": process.enter,
           "gold": process.metadata["answer"],
           "pred": traj.output,
           "rating": fb.rating,
           "success": fb.success
       })
       complete += 1
       right += int(fb.success)
   rating = right / max(complete, 1)
   return rating, rows


def print_table(rows, title, max_rows=20):
   print("n" + "=" * 110)
   print(title)
   print("=" * 110)
   proven = rows[:max_rows]
   for r in proven:
       print(f"[{r['task_id']}] rule={r['rule']}")
       print(f"  enter : {r['input']}")
       print(f"  gold  : {r['gold']}")
       print(f"  pred  : {r['pred']}")
       print(f"  rating : {r['score']}  success={r['success']}")
       print("-" * 110)


def show_workspace(root: Path):
   print("n" + "=" * 110)
   print("EVOLVED WORKSPACE SNAPSHOT")
   print("=" * 110)
   for path in sorted(root.rglob("*")):
       rel = path.relative_to(root)
       if path.is_dir():
           print(f"[DIR ] {rel}/")
       else:
           print(f"[FILE] {rel}")


def show_skill_contents(root: Path):
   skill_files = sorted((root / "expertise").glob("*/SKILL.md"))
   print("n" + "=" * 110)
   print("SKILL FILES")
   print("=" * 110)
   if not skill_files:
       print("No ability recordsdata but.")
   for sf in skill_files:
       print(f"n--- {sf.mother or father.title}/SKILL.md ---")
       print(sf.read_text())

We construct a customized evolution engine that inspects failures and decides how to mutate the workspace. We use it to harden the immediate, add lacking expertise, and retailer episodic reminiscence in order that the agent regularly learns higher formatting and task-specific conduct throughout cycles. We additionally outline analysis and reporting utilities that assist us rating the agent, examine predictions, and view the advanced workspace clearly.

Copy Code

benchmark = MiniTextBenchmark()
agent = ColabAEResolverAgent(WORKSPACE_ROOT, mannequin=OPENAI_MODEL)
engine = ColabMutationEngine()


baseline_train_score, baseline_train_rows = evaluate_split(agent, benchmark, cut up="practice")
baseline_holdout_score, baseline_holdout_rows = evaluate_split(agent, benchmark, cut up="holdout")


print(f"Baseline practice rating   : {baseline_train_score:.3f}")
print(f"Baseline holdout rating : {baseline_holdout_score:.3f}")


print_table(baseline_train_rows, "BASELINE TRAIN RESULTS")
print_table(baseline_holdout_rows, "BASELINE HOLDOUT RESULTS")


config = ae.EvolveConfig(
   batch_size=8,
   max_cycles=4,
   egl_window=2
)


evolver = ae.Evolver(
   agent=agent,
   benchmark=benchmark,
   config=config,
   engine=engine
)


end result = evolver.run(cycles=4)


print("n" + "=" * 110)
print("A-EVOLVE RUN SUMMARY")
print("=" * 110)
print(f"Cycles accomplished : {end result.cycles_completed}")
print(f"Final practice rating: {end result.final_score:.3f}")
print(f"Score historical past    : {end result.score_history}")
print(f"Converged        : {end result.converged}")


agent.reload_from_fs()
final_train_score, final_train_rows = evaluate_split(agent, benchmark, cut up="practice")
final_holdout_score, final_holdout_rows = evaluate_split(agent, benchmark, cut up="holdout")


print(f"nFinal practice rating   : {final_train_score:.3f}")
print(f"Final holdout rating : {final_holdout_score:.3f}")


print_table(final_train_rows, "FINAL TRAIN RESULTS")
print_table(final_holdout_rows, "FINAL HOLDOUT RESULTS")


show_workspace(WORKSPACE_ROOT)
show_skill_contents(WORKSPACE_ROOT)


print("n" + "=" * 110)
print("FINAL SYSTEM PROMPT")
print("=" * 110)
print((WORKSPACE_ROOT / "prompts" / "system.md").read_text())


episodic_path = WORKSPACE_ROOT / "reminiscence" / "episodic.jsonl"
if episodic_path.exists():
   print("n" + "=" * 110)
   print("RECENT EPISODIC MEMORY")
   print("=" * 110)
   strains = episodic_path.read_text().strip().splitlines()
   for line in strains[-10:]:
       print(line)


plt.determine(figsize=(8, 4))
plt.plot(vary(1, len(end result.score_history) + 1), end result.score_history, marker="o")
plt.xlabel("Evolution cycle")
plt.ylabel("Train rating")
plt.title("A-Evolve rating historical past")
plt.grid(True)
plt.present()


print("n" + "=" * 110)
print("COMPARISON")
print("=" * 110)
print(f"Train   : {baseline_train_score:.3f} -> {final_train_score:.3f}")
print(f"Holdout : {baseline_holdout_score:.3f} -> {final_holdout_score:.3f}")


improved_rules = []
for earlier than, after in zip(sorted(baseline_train_rows, key=lambda x: x["task_id"]), sorted(final_train_rows, key=lambda x: x["task_id"])):
   if (not earlier than["success"]) and after["success"]:
       improved_rules.append(after["rule"])


print(f"Improved practice instances by rule: {dict(Counter(improved_rules))}")


print("nDone. This pocket book used the actual A-Evolve framework and demonstrated:")
print("1) a legitimate agent workspace")
print("2) a BaseAgent subclass")
print("3) a BenchmarkAdapter subclass")
print("4) an EvolutionEngine subclass")
print("5) immediate / ability / reminiscence mutations throughout A-Evolve cycles")

We put the whole lot collectively and run the complete A-Evolve loop from baseline analysis to post-evolution evaluation. We measure the agent earlier than coaching, execute a number of evolution cycles, reload the workspace, and then examine the ultimate practice and holdout efficiency to see what improves. We additionally examine the advanced immediate, expertise, reminiscence, and rating historical past, which lets us clearly observe how the framework transforms the agent step-by-step.

In conclusion, we efficiently constructed and ran a full A-Evolve workflow relatively than simply inspecting the repository at a floor degree. We created a legitimate workspace, plugged in a customized agent, benchmarked it on structured duties, and then advanced its conduct by modifying prompts, including expertise, and storing reminiscence throughout cycles. Also, we noticed how A-Evolve’s design allows us to deal with agent enchancment as a repeatable engineering course of, through which we are able to measure baseline efficiency, apply managed mutations, and observe how the system turns into extra correct over time.

Check out the Full Coding Notebook here. Also, be happy to comply with us on Twitter and don’t neglect to be a part of our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish How to Build and Evolve a Custom OpenAI Agent with A-Evolve Using Benchmarks, Skills, Memory, and Workspace Mutations appeared first on MarkTechPost.