How to Build an Advanced AI Agent with Summarized Short-Term and Vector-Based Long-Term Memory

On this tutorial, we stroll you thru constructing a sophisticated AI Agent that not solely chats but in addition remembers. We begin from scratch and show find out how to mix a light-weight LLM, FAISS vector search, and a summarization mechanism to create each short-term and long-term reminiscence. By working along with embeddings and auto-distilled info, we will craft an agent that adapts to our directions, remembers vital particulars in future conversations, and intelligently compresses context, making certain the interplay stays clean and environment friendly. Try the FULL CODES here.

Copy Code

!pip -q set up transformers speed up bitsandbytes sentence-transformers faiss-cpu


import os, json, time, uuid, math, re
from datetime import datetime
import torch, faiss
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

We start by putting in the important libraries and importing all of the required modules for our agent. We arrange the setting to find out whether or not we’re utilizing a GPU or a CPU, permitting us to run the mannequin effectively. Try the FULL CODES here.

Copy Code

def load_llm(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"):
   attempt:
       if DEVICE=="cuda":
           bnb=BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_quant_type="nf4")
           tok=AutoTokenizer.from_pretrained(model_name, use_fast=True)
           mdl=AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb, device_map="auto")
       else:
           tok=AutoTokenizer.from_pretrained(model_name, use_fast=True)
           mdl=AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, low_cpu_mem_usage=True)
       return pipeline("text-generation", mannequin=mdl, tokenizer=tok, system=0 if DEVICE=="cuda" else -1, do_sample=True)
   besides Exception as e:
       elevate RuntimeError(f"Did not load LLM: {e}")

We outline a perform to load our language mannequin. We set it up in order that if a GPU is offered, we use 4-bit quantization for effectivity; in any other case, we fall again to the CPU with optimized settings. This ensures we will generate textual content easily whatever the {hardware} we’re working on. Try the FULL CODES here.

Copy Code

class VectorMemory:
   def __init__(self, path="/content material/agent_memory.json", dim=384):
       self.path=path; self.dim=dim; self.gadgets=[]
       self.embedder=SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", system=DEVICE)
       self.index=faiss.IndexFlatIP(dim)
       if os.path.exists(path):
           information=json.load(open(path))
           self.gadgets=information.get("gadgets",[])
           if self.gadgets:
               X=torch.tensor([x["emb"] for x in self.gadgets], dtype=torch.float32).numpy()
               self.index.add(X)
   def _emb(self, textual content):
       v=self.embedder.encode([text], normalize_embeddings=True)[0]
       return v.tolist()
   def add(self, textual content, meta=None):
       e=self._emb(textual content); self.index.add(torch.tensor([e]).numpy())
       rec={"id":str(uuid.uuid4()),"textual content":textual content,"meta":meta or {}, "emb":e}
       self.gadgets.append(rec); self._save(); return rec["id"]
   def search(self, question, ok=5, thresh=0.25):
       if len(self.gadgets)==0: return []
       q=self.embedder.encode([query], normalize_embeddings=True)
       D,I=self.index.search(q, min(ok, len(self.gadgets)))
       out=[]
       for d,i in zip(D[0],I[0]):
           if i==-1: proceed
           if d>=thresh: out.append((d,self.gadgets[i]))
       return out
   def _save(self):
       slim=[{k:v for k,v in it.items()} for it in self.items]
       json.dump({"gadgets":slim}, open(self.path,"w"), indent=2)

We create a VectorMemory class that provides our agent long-term reminiscence. We retailer previous interactions as embeddings utilizing MiniLM and index them with FAISS, permitting us to look and recall related data later. Every reminiscence is saved to disk, enabling the agent to retain its reminiscence throughout periods. Try the FULL CODES here.

Copy Code

def now_iso(): return datetime.now().isoformat(timespec="seconds")
def clamp(txt, n=1600): return txt if len(txt)<=n else txt[:n]+" …"
def strip_json(s):
   m=re.search(r"{.*}", s, flags=re.S);
   return m.group(0) if m else None


SYS_GUIDE = (
"You're a useful, concise assistant with reminiscence. Use offered MEMORY when related. "
"Favor info from MEMORY over guesses. Reply instantly; hold code blocks tight. If uncertain, say so."
)


SUMMARIZE_PROMPT = lambda convo: f"Summarize the dialog under in 4-6 bullet factors specializing in steady info and duties:nn{convo}nnSummary:"
DISTILL_PROMPT = lambda person: (
f"""Determine if the USER textual content accommodates sturdy data price long-term reminiscence (preferences, identification, initiatives, deadlines, info).
Return compact JSON solely: {{"save": true/false, "reminiscence": "one-sentence reminiscence"}}.
USER: {person}""")


class MemoryAgent:
   def __init__(self):
       self.llm=load_llm()
       self.mem=VectorMemory()
       self.turns=[]    
       self.abstract=""   
       self.max_turns=10
   def _gen(self, immediate, max_new_tokens=256, temp=0.7):
       out=self.llm(immediate, max_new_tokens=max_new_tokens, temperature=temp, top_p=0.95, num_return_sequences=1, pad_token_id=self.llm.tokenizer.eos_token_id)[0]["generated_text"]
       return out[len(prompt):].strip() if out.startswith(immediate) else out.strip()
   def _chat_prompt(self, person, memory_context):
       convo="n".be part of([f"{r.upper()}: {t}" for r,t in self.turns[-8:]])
       sys=f"System: {SYS_GUIDE}nTime: {now_iso()}nn"
       mem = f"MEMORY (related excerpts):n{memory_context}nn" if memory_context else ""
       summ=f"CONTEXT SUMMARY:n{self.abstract}nn" if self.abstract else ""
       return sys+mem+summ+convo+f"nUSER: {person}nASSISTANT:"
   def _distill_and_store(self, person):
       attempt:
           uncooked=self._gen(DISTILL_PROMPT(person), max_new_tokens=120, temp=0.1)
           js=strip_json(uncooked)
           if js:
               obj=json.masses(js)
               if obj.get("save") and obj.get("reminiscence"):
                   self.mem.add(obj["memory"], {"ts":now_iso(),"supply":"distilled"})
                   return True, obj["memory"]
       besides Exception: go
       if re.search(r"b(my identify is|name me|I like|deadline|due|electronic mail|telephone|engaged on|favor|timezone|birthday|aim|examination)b", person, flags=re.I):
           m=f"Person stated: {clamp(person,120)}"
           self.mem.add(m, {"ts":now_iso(),"supply":"heuristic"})
           return True, m
       return False, ""
   def _maybe_summarize(self):
       if len(self.turns)>self.max_turns:
           convo="n".be part of([f"{r}: {t}" for r,t in self.turns])
           s=self._gen(SUMMARIZE_PROMPT(clamp(convo, 3500)), max_new_tokens=180, temp=0.2)
           self.abstract=s; self.turns=self.turns[-4:]
   def recall(self, question, ok=5):
       hits=self.mem.search(question, ok=ok)
       return "n".be part of([f"- ({d:.2f}) {h['text']} [meta={h['meta']}]" for d,h in hits])
   def ask(self, person):
       self.turns.append(("person", person))
       saved, memline = self._distill_and_store(person)
       mem_ctx=self.recall(person, ok=6)
       immediate=self._chat_prompt(person, mem_ctx)
       reply=self._gen(immediate)
       self.turns.append(("assistant", reply))
       self._maybe_summarize()
       standing=f" memory_saved: {saved}; " + (f"notice: {memline}" if saved else "notice: -")
       print(f"nUSER: {person}nASSISTANT: {reply}n{standing}")
       return reply

We convey the whole lot collectively into the MemoryAgent class. We design the agent to generate responses with context, distill vital info into long-term reminiscence, and periodically summarize conversations to handle short-term context. With this setup, we create an assistant that remembers, remembers, and adapts to our interactions with it. Try the FULL CODES here.

Copy Code

agent=MemoryAgent()


print(" Agent prepared. Strive these:n")
agent.ask("Hello! My identify is Nicolaus, I favor being known as Nik. I am making ready for UPSC in 2027.")
agent.ask("Additionally, I work at  Visa in analytics and love concise solutions.")
agent.ask("What's my examination 12 months and the way do you have to tackle me subsequent time?")
agent.ask("Reminder: I like agentic RAG tutorials with single-file Colab code.")
agent.ask("Given my prefs, recommend a research focus for this week in a single paragraph.")

We instantiate our MemoryAgent and instantly train it with just a few messages to seed long-term recollections and confirm recall. We affirm it remembers our most well-liked identify and examination 12 months, adapts replies to our concise type, and makes use of previous preferences (agentic RAG, single-file Colab) to tailor research steerage within the current.

In conclusion, we see how highly effective it’s after we give our AI Agent the power to recollect. We now have an agent that shops key particulars, remembers them when related, and summarizes conversations to remain environment friendly. This strategy retains our interactions contextual and evolving, making the agent really feel extra private and clever with every alternate. With this basis, we’re prepared to increase reminiscence additional, discover richer schemas, and experiment with extra superior memory-augmented agent designs.

Try the FULL CODES here. Be happy to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

The submit How to Build an Advanced AI Agent with Summarized Short-Term and Vector-Based Long-Term Memory appeared first on MarkTechPost.