Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering

In this tutorial, we construct a Reinforcement Learning–pushed agent that learns how to retrieve related reminiscences from a long-term reminiscence financial institution. We begin by setting up a artificial reminiscence dataset and producing queries that require the agent to recall particular data. Using OpenAI embeddings, we convert each reminiscences and queries into vector representations, enabling similarity indicators to information candidate retrieval. We then design a customized RL atmosphere wherein the agent observes options of candidate reminiscences and learns a coverage to choose probably the most helpful one. By coaching the agent with the PPO algorithm, we allow it to enhance retrieval selections past easy similarity search. Finally, we consider the system by evaluating the RL-based retriever with a baseline method and show how an LLM can use retrieved reminiscences to generate correct solutions.

Copy Code

import sys
import subprocess
import pkgutil
import os
import json
import math
import random
import textwrap
import getpass
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple


def _install_if_missing(packages):
   lacking = []
   for package_name, import_name in packages:
       if pkgutil.find_loader(import_name) is None:
           lacking.append(package_name)
   if lacking:
       subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + lacking)


_install_if_missing([
   ("openai>=1.40.0", "openai"),
   ("gymnasium>=0.29.1", "gymnasium"),
   ("stable-baselines3>=2.3.2", "stable_baselines3"),
   ("numpy>=1.26.4", "numpy"),
   ("pandas>=2.2.2", "pandas"),
   ("scikit-learn>=1.5.1", "sklearn"),
   ("matplotlib>=3.9.0", "matplotlib"),
   ("tqdm>=4.66.4", "tqdm"),
])


import numpy as np
import pandas as pd
import gymnasium as gymnasium
from gymnasium import areas
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from stable_baselines3 import PPO
from stable_baselines3.frequent.vec_env import DummyVecEnv
from openai import OpenAI


SEED = 42
random.seed(SEED)
np.random.seed(SEED)


strive:
   from google.colab import userdata
   OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
besides Exception:
   OPENAI_API_KEY = None


if not OPENAI_API_KEY:
   OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


if not OPENAI_API_KEY:
   OPENAI_API_KEY = getpass.getpass("Enter OPENAI_API_KEY: ").strip()


consumer = OpenAI(api_key=OPENAI_API_KEY)


EMBED_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o-mini"


def chunked(xs, n):
   for i in vary(0, len(xs), n):
       yield xs[i:i+n]


def embed_texts(texts: List[str], mannequin: str = EMBED_MODEL, batch_size: int = 64) -> np.ndarray:
   outputs = []
   for batch in tqdm(checklist(chunked(texts, batch_size)), desc="Embedding"):
       resp = consumer.embeddings.create(mannequin=mannequin, enter=batch)
       batch_vecs = [d.embedding for d in resp.data]
       outputs.lengthen(batch_vecs)
   arr = np.array(outputs, dtype=np.float32)
   norms = np.linalg.norm(arr, axis=1, keepdims=True) + 1e-12
   arr = arr / norms
   return arr


def chat_answer(query: str, retrieved_memories: List[Dict[str, Any]], mannequin: str = CHAT_MODEL) -> str:
   memory_block = "n".be part of([f"[Memory {i+1}] {m['text']}" for i, m in enumerate(retrieved_memories)])
   system = "You are a exact QA assistant. Answer the query utilizing solely the supplied reminiscences. If the reminiscences don't include the reply, say 'I have no idea from the supplied reminiscences.'"
   consumer = f"Question: {query}nnRetrieved reminiscences:n{memory_block}nnAnswer:"
   resp = consumer.chat.completions.create(
       mannequin=mannequin,
       temperature=0,
       messages=[
           {"role": "system", "content": system},
           {"role": "user", "content": user},
       ],
   )
   return resp.decisions[0].message.content material.strip()


def llm_judge_exact(query: str, gold_answer: str, predicted_answer: str, mannequin: str = CHAT_MODEL) -> float:
   system = "You are a strict evaluator. Return solely JSON with a single key 'rating'. Use 1.0 if the anticipated reply is semantically right, 0.0 in any other case."
   consumer = json.dumps({
       "query": query,
       "gold_answer": gold_answer,
       "predicted_answer": predicted_answer,
   }, ensure_ascii=False)
   resp = consumer.chat.completions.create(
       mannequin=mannequin,
       temperature=0,
       response_format={"sort": "json_object"},
       messages=[
           {"role": "system", "content": system},
           {"role": "user", "content": user},
       ],
   )
   txt = resp.decisions[0].message.content material.strip()
   strive:
       obj = json.masses(txt)
       rating = float(obj["score"])
       return 1.0 if rating >= 0.5 else 0.0
   besides Exception:
       return 0.0

We arrange the atmosphere required for our reinforcement studying–primarily based reminiscence retrieval system. We set up all required libraries, import the mandatory modules, and securely load the OpenAI API key for embedding and language mannequin interactions. We additionally outline helper features that generate embeddings, produce solutions from retrieved reminiscences, and consider solutions utilizing an LLM-based judging mechanism.

Copy Code

@dataclass
class MemoryMerchandise:
   memory_id: int
   matter: str
   entity: str
   slot: str
   worth: str
   textual content: str


def build_memory_bank() -> List[MemoryItem]:
   entities = [
       {
           "entity": "Astra",
           "topic": "robotics",
           "facts": {
               "battery": "18 hours",
               "sensor": "LiDAR",
               "country": "Japan",
               "release_year": "2023",
               "specialty": "warehouse navigation",
           },
       },
       {
           "entity": "Orion",
           "topic": "astronomy",
           "facts": {
               "telescope": "infrared array",
               "country": "Chile",
               "discovery_year": "2019",
               "target": "exoplanet atmospheres",
               "aperture": "8 meters",
           },
       },
       {
           "entity": "Vita",
           "topic": "biomedicine",
           "facts": {
               "compound": "VX-17",
               "trial_phase": "Phase II",
               "country": "Canada",
               "target": "inflammatory markers",
               "delivery": "oral capsule",
           },
       },
       {
           "entity": "Nimbus",
           "topic": "climate",
           "facts": {
               "satellite": "polar orbiter",
               "country": "Norway",
               "launch_year": "2022",
               "instrument": "microwave radiometer",
               "mission": "sea ice monitoring",
           },
       },
       {
           "entity": "Atlas",
           "topic": "logistics",
           "facts": {
               "fleet_size": "240 trucks",
               "hub": "Muscat",
               "software": "predictive routing",
               "fuel_policy": "hybrid-first",
               "region": "GCC",
           },
       },
       {
           "entity": "Lumos",
           "topic": "materials",
           "facts": {
               "alloy": "Ti-6Al-4V",
               "process": "laser sintering",
               "density": "4.43 g/cm3",
               "country": "Germany",
               "use_case": "aerospace brackets",
           },
       },
       {
           "entity": "Cedar",
           "topic": "agriculture",
           "facts": {
               "crop": "wheat",
               "irrigation": "drip control",
               "country": "India",
               "yield_gain": "12 percent",
               "soil_sensor": "capacitive probe",
           },
       },
       {
           "entity": "Pulse",
           "topic": "healthcare",
           "facts": {
               "device": "ECG patch",
               "battery": "7 days",
               "country": "USA",
               "connectivity": "Bluetooth Low Energy",
               "use_case": "arrhythmia screening",
           },
       },
   ]


   phrasing_templates = [
       "{entity} in {topic} uses {value} for {slot}.",
       "The {slot} associated with {entity} is {value}.",
       "{entity} has {slot}: {value}.",
       "For {entity}, the recorded {slot} is {value}.",
       "Reference note: {entity} -> {slot} = {value}.",
   ]


   distractor_templates = [
       "{entity} was discussed in a briefing about cross-domain innovation.",
       "{entity} has been compared with several other projects in recent reports.",
       "A summary note mentions {entity} among notable initiatives.",
       "{entity} appears in a high-level update without technical details.",
       "Stakeholders reviewed {entity} in a strategic planning session.",
   ]


   memory_bank = []
   memory_id = 0


   for merchandise in entities:
       entity = merchandise["entity"]
       matter = merchandise["topic"]
       for slot, worth in merchandise["facts"].gadgets():
           for t in phrasing_templates:
               textual content = t.format(entity=entity, matter=matter, slot=slot, worth=worth)
               memory_bank.append(MemoryMerchandise(
                   memory_id=memory_id,
                   matter=matter,
                   entity=entity,
                   slot=slot,
                   worth=worth,
                   textual content=textual content
               ))
               memory_id += 1


       for t in distractor_templates:
           textual content = t.format(entity=entity)
           memory_bank.append(MemoryMerchandise(
               memory_id=memory_id,
               matter=matter,
               entity=entity,
               slot="distractor",
               worth="n/a",
               textual content=textual content
           ))
           memory_id += 1


   extra_noise = [
       "General note: system maintenance occurred on Tuesday.",
       "A committee discussed budget timelines and operational readiness.",
       "The archive includes summaries of projects across multiple departments.",
       "No relevant technical value is stated in this memory.",
       "A status update mentioned partnerships and future opportunities.",
       "An unrelated note references shipping delays and staffing changes.",
       "Background memo: the team reviewed dashboards and reporting cadence.",
       "This memory contains no answer-bearing facts.",
   ]


   for textual content in extra_noise:
       memory_bank.append(MemoryMerchandise(
           memory_id=memory_id,
           matter="noise",
           entity="none",
           slot="distractor",
           worth="n/a",
           textual content=textual content
       ))
       memory_id += 1


   return memory_bank


memory_bank = build_memory_bank()
memory_texts = [m.text for m in memory_bank]
memory_embeddings = embed_texts(memory_texts)


def build_queries(memory_bank: List[MemoryItem]) -> List[Dict[str, Any]]:
   patterns = [
       "What is the {slot} of {entity}?",
       "Which {slot} does {entity} have?",
       "Tell me the {slot} for {entity}.",
       "Can you recall the {slot} associated with {entity}?",
       "What was recorded as the {slot} of {entity}?",
   ]
   queries = []
   qid = 0
   for m in memory_bank:
       if m.slot == "distractor":
           proceed
       q = random.alternative(patterns).format(slot=m.slot.exchange("_", " "), entity=m.entity)
       queries.append({
           "query_id": qid,
           "question": q,
           "entity": m.entity,
           "slot": m.slot,
           "gold_value": m.worth,
           "gold_memory_id": m.memory_id,
           "gold_text": m.textual content,
           "matter": m.matter,
       })
       qid += 1
   random.shuffle(queries)
   return queries


queries = build_queries(memory_bank)
query_texts = [q["query"] for q in queries]
query_embeddings = embed_texts(query_texts)

We assemble a artificial long-term reminiscence financial institution that simulates saved information throughout a number of domains. We generate structured reminiscence gadgets and convert them into textual reminiscences that can later be embedded for semantic retrieval. We additionally create question datasets from these reminiscences and embed them so the agent can examine queries with saved information.

Copy Code

MEM_BY_ID = {m.memory_id: m for m in memory_bank}
QUERY_BY_ID = {q["query_id"]: q for q in queries}


def keyword_overlap(a: str, b: str) -> float:
   ta = set(a.decrease().exchange("?", "").exchange(".", "").cut up())
   tb = set(b.decrease().exchange("?", "").exchange(".", "").cut up())
   if not ta or not tb:
       return 0.0
   return len(ta & tb) / max(1, len(ta | tb))


def get_top_k_candidates(query_idx: int, okay: int = 8) -> Dict[str, Any]:
   qv = query_embeddings[query_idx:query_idx+1]
   sims = cosine_similarity(qv, memory_embeddings)[0]
   top_idx = np.argsort(-sims)[:k]
   candidates = []
   q = queries[query_idx]
   for rank, midx in enumerate(top_idx):
       mem = memory_bank[midx]
       sim = float(sims[midx])
       overlap = keyword_overlap(q["query"], mem.textual content)
       entity_match = 1.0 if q["entity"].decrease() in mem.textual content.decrease() else 0.0
       slot_match = 1.0 if q["slot"].exchange("_", " ").decrease() in mem.textual content.decrease() else 0.0
       is_gold = 1.0 if mem.memory_id == q["gold_memory_id"] else 0.0
       candidates.append({
           "rank": rank,
           "memory_index": midx,
           "memory_id": mem.memory_id,
           "textual content": mem.textual content,
           "sim": sim,
           "overlap": overlap,
           "entity_match": entity_match,
           "slot_match": slot_match,
           "is_gold": is_gold,
       })
   return {"question": q, "candidates": candidates}


ALL_CANDIDATES = [get_top_k_candidates(i, k=8) for i in range(len(queries))]


def build_state_features(merchandise: Dict[str, Any]) -> np.ndarray:
   q = merchandise["query"]
   feats = []
   for c in merchandise["candidates"]:
       feats.lengthen([
           c["sim"],
           c["overlap"],
           c["entity_match"],
           c["slot_match"],
           1.0 / (1.0 + c["rank"]),
       ])
   unique_topic_bonus = 1.0 if q["topic"] in q["query"].decrease() else 0.0
   query_len = min(len(q["query"].cut up()) / 20.0, 1.0)
   feats.lengthen([unique_topic_bonus, query_len])
   return np.array(feats, dtype=np.float32)


STATE_DIM = len(build_state_features(ALL_CANDIDATES[0]))
NUM_ACTIONS = len(ALL_CANDIDATES[0]["candidates"])


class MemoryRetrievalEnv(gymnasium.Env):
   metadata = {"render_modes": ["human"]}


   def __init__(self, candidate_items: List[Dict[str, Any]], seed: int = 42):
       tremendous().__init__()
       self.candidate_items = candidate_items
       self.rng = np.random.default_rng(seed)
       self.observation_space = areas.Box(low=-10, excessive=10, form=(STATE_DIM,), dtype=np.float32)
       self.action_space = areas.Discrete(NUM_ACTIONS)
       self.present = None


   def reset(self, seed=None, choices=None):
       if seed just isn't None:
           self.rng = np.random.default_rng(seed)
       idx = int(self.rng.integers(0, len(self.candidate_items)))
       self.present = self.candidate_items[idx]
       obs = build_state_features(self.present)
       information = {"query_id": self.present["query"]["query_id"]}
       return obs, information


   def step(self, motion):
       chosen = self.present["candidates"][int(action)]
       q = self.present["query"]


       reward = 0.0
       reward += 2.0 * chosen["is_gold"]
       reward += 0.8 * chosen["entity_match"]
       reward += 0.6 * chosen["slot_match"]
       reward += 0.5 * chosen["sim"]
       reward += 0.3 * chosen["overlap"]
       reward -= 0.15 * chosen["rank"]


       executed = True
       truncated = False
       information = {
           "query_id": q["query_id"],
           "chosen_memory_id": chosen["memory_id"],
           "gold_memory_id": q["gold_memory_id"],
           "chosen_text": chosen["text"],
           "gold_text": q["gold_text"],
           "is_correct": bool(chosen["memory_id"] == q["gold_memory_id"]),
           "gold_value": q["gold_value"],
           "question": q["query"],
       }
       next_obs = np.zeros(self.observation_space.form, dtype=np.float32)
       return next_obs, float(reward), executed, truncated, information

We put together candidate reminiscences for every question by computing similarity scores between question embeddings and reminiscence embeddings. We then assemble characteristic vectors that describe every candidate reminiscence utilizing similarity, key phrase overlap, entity matching, and rank indicators. Finally, we outline a customized reinforcement studying atmosphere wherein the agent learns to choose the perfect reminiscence to reply every question.

Copy Code

split_1 = int(0.7 * len(ALL_CANDIDATES))
split_2 = int(0.85 * len(ALL_CANDIDATES))
train_items = ALL_CANDIDATES[:split_1]
val_items = ALL_CANDIDATES[split_1:split_2]
test_items = ALL_CANDIDATES[split_2:]


train_env = DummyVecEnv([lambda: MemoryRetrievalEnv(train_items, seed=SEED)])
mannequin = PPO(
   "MlpPolicy",
   train_env,
   learning_rate=3e-4,
   n_steps=256,
   batch_size=64,
   gamma=0.99,
   gae_lambda=0.95,
   ent_coef=0.01,
   clip_range=0.2,
   verbose=0,
   seed=SEED,
)


mannequin.study(total_timesteps=12000)


def baseline_retrieve(merchandise: Dict[str, Any]) -> Dict[str, Any]:
   greatest = max(merchandise["candidates"], key=lambda x: x["sim"])
   return greatest


def rl_retrieve(merchandise: Dict[str, Any]) -> Dict[str, Any]:
   obs = build_state_features(merchandise)
   motion, _ = mannequin.predict(obs, deterministic=True)
   return merchandise["candidates"][int(action)]


def evaluate_retriever(gadgets: List[Dict[str, Any]], retriever_fn) -> Dict[str, Any]:
   rows = []
   for merchandise in gadgets:
       chosen = retriever_fn(merchandise)
       q = merchandise["query"]
       rows.append({
           "query_id": q["query_id"],
           "question": q["query"],
           "gold_value": q["gold_value"],
           "gold_memory_id": q["gold_memory_id"],
           "chosen_memory_id": chosen["memory_id"],
           "correct_retrieval": int(chosen["memory_id"] == q["gold_memory_id"]),
           "chosen_text": chosen["text"],
       })
   df = pd.DataFrame(rows)
   return {
       "df": df,
       "retrieval_accuracy": df["correct_retrieval"].imply(),
   }


baseline_val = evaluate_retriever(val_items, baseline_retrieve)
rl_val = evaluate_retriever(val_items, rl_retrieve)
baseline_test = evaluate_retriever(test_items, baseline_retrieve)
rl_test = evaluate_retriever(test_items, rl_retrieve)


print("Validation Retrieval Accuracy")
print("Baseline:", spherical(float(baseline_val["retrieval_accuracy"]), 4))
print("RL      :", spherical(float(rl_val["retrieval_accuracy"]), 4))
print()
print("Test Retrieval Accuracy")
print("Baseline:", spherical(float(baseline_test["retrieval_accuracy"]), 4))
print("RL      :", spherical(float(rl_test["retrieval_accuracy"]), 4))


results_df = pd.DataFrame([
   {"split": "validation", "method": "baseline_cosine", "retrieval_accuracy": float(baseline_val["retrieval_accuracy"])},
   {"cut up": "validation", "methodology": "rl_agent", "retrieval_accuracy": float(rl_val["retrieval_accuracy"])},
   {"cut up": "check", "methodology": "baseline_cosine", "retrieval_accuracy": float(baseline_test["retrieval_accuracy"])},
   {"cut up": "check", "methodology": "rl_agent", "retrieval_accuracy": float(rl_test["retrieval_accuracy"])},
])
show(results_df)


plot_df = results_df.copy()
for split_name in ["validation", "test"]:
   sub = plot_df[plot_df["split"] == split_name]
   plt.determine(figsize=(6, 4))
   plt.bar(sub["method"], sub["retrieval_accuracy"])
   plt.title(f"Retrieval Accuracy on {split_name.title()}")
   plt.ylim(0, 1)
   plt.ylabel("Accuracy")
   plt.present()

We cut up the datasets and initialize the reinforcement studying mannequin. We practice a PPO agent to study a coverage for choosing probably the most related reminiscence from a set of candidates. After coaching, we consider the agent’s retrieval efficiency and examine it with a baseline embedding-similarity method.

Copy Code

def answer_with_retriever(merchandise: Dict[str, Any], retriever_fn) -> Dict[str, Any]:
   q = merchandise["query"]
   chosen = retriever_fn(merchandise)
   retrieved_memories = [{
       "memory_id": chosen["memory_id"],
       "textual content": chosen["text"],
   }]
   reply = chat_answer(q["query"], retrieved_memories)
   judged = llm_judge_exact(q["query"], q["gold_value"], reply)
   return {
       "question": q["query"],
       "gold_value": q["gold_value"],
       "retrieved_text": chosen["text"],
       "predicted_answer": reply,
       "answer_score": judged,
       "retrieval_correct": int(chosen["memory_id"] == q["gold_memory_id"]),
   }


sample_test_items = random.pattern(test_items, min(12, len(test_items)))
baseline_answers = [answer_with_retriever(item, baseline_retrieve) for item in tqdm(sample_test_items, desc="Baseline QA")]
rl_answers = [answer_with_retriever(item, rl_retrieve) for item in tqdm(sample_test_items, desc="RL QA")]


baseline_answer_df = pd.DataFrame(baseline_answers)
rl_answer_df = pd.DataFrame(rl_answers)


print("Sample Downstream QA Accuracy")
print("Baseline:", spherical(float(baseline_answer_df["answer_score"].imply()), 4))
print("RL      :", spherical(float(rl_answer_df["answer_score"].imply()), 4))


comparability = pd.DataFrame([
   {"method": "baseline_cosine", "qa_accuracy": float(baseline_answer_df["answer_score"].imply())},
   {"methodology": "rl_agent", "qa_accuracy": float(rl_answer_df["answer_score"].imply())},
])
show(comparability)


plt.determine(figsize=(6, 4))
plt.bar(comparability["method"], comparability["qa_accuracy"])
plt.title("Downstream QA Accuracy from Retrieved Memories")
plt.ylim(0, 1)
plt.ylabel("Accuracy")
plt.present()


def inspect_examples(gadgets: List[Dict[str, Any]], n: int = 5):
   chosen_items = random.pattern(gadgets, min(n, len(gadgets)))
   rows = []
   for merchandise in chosen_items:
       q = merchandise["query"]
       base = baseline_retrieve(merchandise)
       rlm = rl_retrieve(merchandise)
       rows.append({
           "question": q["query"],
           "gold_value": q["gold_value"],
           "baseline_text": base["text"],
           "baseline_correct": int(base["memory_id"] == q["gold_memory_id"]),
           "rl_text": rlm["text"],
           "rl_correct": int(rlm["memory_id"] == q["gold_memory_id"]),
       })
   return pd.DataFrame(rows)


examples_df = inspect_examples(test_items, n=8)
pd.set_option("show.max_colwidth", 200)
show(examples_df)

We consider how nicely the retrieved reminiscences help downstream query answering. We generate solutions utilizing the retrieved reminiscence context and assess the solutions with an LLM-based choose to decide correctness. We additionally examine instance queries to visually examine how the baseline retriever and the RL agent select totally different reminiscences.

Copy Code

def interactive_demo(query: str, top_k: int = 8):
   qv = embed_texts()
   sims = cosine_similarity(qv, memory_embeddings)[0]
   top_idx = np.argsort(-sims)[:top_k]


   candidates = []
   for rank, midx in enumerate(top_idx):
       mem = memory_bank[midx]
       candidates.append({
           "rank": rank,
           "memory_index": int(midx),
           "memory_id": int(mem.memory_id),
           "textual content": mem.textual content,
           "sim": float(sims[midx]),
           "overlap": keyword_overlap(query, mem.textual content),
           "entity_match": 0.0,
           "slot_match": 0.0,
           "is_gold": 0.0,
       })


   pseudo_item = {
       "question": {
           "query_id": -1,
           "question": query,
           "entity": "unknown",
           "slot": "unknown",
           "gold_value": "unknown",
           "gold_memory_id": -1,
           "gold_text": "",
           "matter": "unknown",
       },
       "candidates": candidates,
   }


   obs = build_state_features(pseudo_item)
   motion, _ = mannequin.predict(obs, deterministic=True)
   chosen = pseudo_item["candidates"][int(action)]
   reply = chat_answer(query, [{"memory_id": selected["memory_id"], "textual content": chosen["text"]}])


   print("=" * 100)
   print("QUESTION")
   print(query)
   print("=" * 100)
   print("TOP CANDIDATES")
   for c in candidates:
       print(f"[Rank {c['rank']}] sim={c['sim']:.4f} | {c['text']}")
   print("=" * 100)
   print("RL-SELECTED MEMORY")
   print(chosen["text"])
   print("=" * 100)
   print("ANSWER")
   print(reply)
   print("=" * 100)


interactive_demo("What is the battery of Pulse?")
interactive_demo("Which hub does Atlas have?")
interactive_demo("Tell me the nation for Cedar.")


artifact_dir = "/content material/rl_agent_memory_retrieval_artifacts"
os.makedirs(artifact_dir, exist_ok=True)


results_df.to_csv(f"{artifact_dir}/retrieval_results.csv", index=False)
baseline_val["df"].to_csv(f"{artifact_dir}/baseline_val.csv", index=False)
rl_val["df"].to_csv(f"{artifact_dir}/rl_val.csv", index=False)
baseline_test["df"].to_csv(f"{artifact_dir}/baseline_test.csv", index=False)
rl_test["df"].to_csv(f"{artifact_dir}/rl_test.csv", index=False)
baseline_answer_df.to_csv(f"{artifact_dir}/baseline_qa_sample.csv", index=False)
rl_answer_df.to_csv(f"{artifact_dir}/rl_qa_sample.csv", index=False)
examples_df.to_csv(f"{artifact_dir}/example_comparisons.csv", index=False)


np.save(f"{artifact_dir}/memory_embeddings.npy", memory_embeddings)
np.save(f"{artifact_dir}/query_embeddings.npy", query_embeddings)
mannequin.save(f"{artifact_dir}/ppo_memory_retriever")


with open(f"{artifact_dir}/memory_bank.json", "w") as f:
   json.dump([m.__dict__ for m in memory_bank], f, indent=2)


with open(f"{artifact_dir}/queries.json", "w") as f:
   json.dump(queries, f, indent=2)


print(f"Saved artifacts to: {artifact_dir}")
print("Tutorial full.")

We construct an interactive demonstration that lets us check the skilled retrieval agent on new questions. We present the candidate reminiscences, spotlight the reminiscence chosen by the RL agent, and generate a solution utilizing the chosen context. Also, we save all artifacts, together with embeddings, outcomes, datasets, and the skilled RL mannequin, so that the system could be reused or additional analyzed.

In conclusion, we demonstrated how reinforcement studying can improve reminiscence retrieval in agentic AI techniques. We skilled an RL agent to choose related reminiscences from a set of candidates utilizing indicators similar to semantic similarity, key phrase overlap, and entity matching. We then evaluated the retriever and noticed how the realized coverage compares with conventional embedding-based retrieval strategies. By integrating the retriever with an LLM, we additionally confirmed how higher reminiscence choice improves downstream question-answering efficiency. Through experiments, visualizations, and interactive demos, we explored how RL can optimize long-term reminiscence entry in clever brokers.

Check out the FULL CODES here. Also, be happy to observe us on Twitter and don’t neglect to be part of our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering appeared first on MarkTechPost.

Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering

11 Best AI Agent Frameworks for Software Developers

StepFun AI Introduce Step-DeepResearch: A Cost-Effective Deep Research Agent Model Built Around Atomic Capabilities

OpenAI Just Launched GPT-5.3-Codex: A Faster Agentic Coding Model Unifying Frontier Code Performance And Professional Reasoning Into One System

DSGym Offers a Reusable Container Based Substrate for Building and Benchmarking Data Science Agents

Alibaba’s Qwen3-Max: Production-Ready Thinking Mode, 1T+ Parameters, and Day-One Coding/Agentic Bench Signals

Agent-Infra Releases AIO Sandbox: An All-in-One Runtime for AI Agents with Browser, Shell, Shared Filesystem, and MCP

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!