How to Build a Meta-Cognitive AI Agent That Dynamically Adjusts Its Own Reasoning Depth for Efficient Problem Solving

In this tutorial, we construct a complicated meta-cognitive management agent that learns how to regulate its personal depth of pondering. We deal with reasoning as a spectrum, starting from quick heuristics to deep chain-of-thought to exact tool-like fixing, and we prepare a neural meta-controller to determine which mode to use for every job. By optimizing the trade-off between accuracy, computation price, and a restricted reasoning finances, we discover how an agent can monitor its inner state and adapt its reasoning technique in actual time. Through every snippet, we experiment, observe patterns, and perceive how meta-cognition emerges when an agent learns to take into consideration its personal pondering. Check out the FULL CODE NOTEBOOK.

Copy Code

import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim




OPS = ['+', '*']


def make_task():
   op = random.alternative(OPS)
   if op == '+':
       a, b = random.randint(1, 99), random.randint(1, 99)
   else:
       a, b = random.randint(2, 19), random.randint(2, 19)
   return a, b, op


def true_answer(a, b, op):
   return a + b if op == '+' else a * b


def true_difficulty(a, b, op):
   if op == '+' and a <= 30 and b <= 30:
       return 0
   if op == '*' and a <= 10 and b <= 10:
       return 1
   return 2


def heuristic_difficulty(a, b, op):
   rating = 0
   if op == '*':
       rating += 0.6
   rating += max(a, b) / 100.0
   return min(rating, 1.0)


def fast_heuristic(a, b, op):
   if op == '+':
       base = a + b
       noise = random.alternative([-2, -1, 0, 0, 0, 1, 2, 3])
   else:
       base = int(0.8 * a * b)
       noise = random.alternative([-5, -3, 0, 0, 2, 5, 8])
   return base + noise, 0.5


def deep_chain_of_thought(a, b, op, verbose=False):
   if op == '+':
       x, y = a, b
       carry = 0
       pos = 1
       end result = 0
       step = 0
       whereas x > 0 or y > 0 or carry:
           dx, dy = x % 10, y % 10
           s = dx + dy + carry
           carry, digit = divmod(s, 10)
           end result += digit * pos
           x //= 10; y //= 10; pos *= 10
           step += 1
   else:
       end result = 0
       step = 0
       for i, d in enumerate(reversed(str(b))):
           row = a * int(d) * (10 ** i)
           end result += row
           step += 1
   return end result, max(2.0, 0.4 * step)


def tool_solver(a, b, op):
   return eval(f"{a}{op}{b}"), 1.2


ACTION_NAMES = ["fast", "deep", "tool"]

We arrange the world our meta-agent operates in. We generate arithmetic duties, outline ground-truth solutions, estimate issue, and implement three totally different reasoning modes. As we run it, we observe how every solver behaves in a different way when it comes to accuracy and computational price, which type the inspiration of the agent’s resolution house. Check out the FULL CODE NOTEBOOK.

Copy Code

def encode_state(a, b, op, rem_budget, error_ema, last_action):
   a_n = a / 100.0
   b_n = b / 100.0
   op_plus = 1.0 if op == '+' else 0.0
   op_mul = 1.0 - op_plus
   diff_hat = heuristic_difficulty(a, b, op)
   rem_n = rem_budget / MAX_BUDGET
   last_onehot = [0.0, 0.0, 0.0]
   if last_action is just not None:
       last_onehot[last_action] = 1.0
   feats = [
       a_n, b_n, op_plus, op_mul,
       diff_hat, rem_n, error_ema
   ] + last_onehot
   return torch.tensor(feats, dtype=torch.float32, machine=machine)


STATE_DIM = 10
N_ACTIONS = 3


class PolicyInternet(nn.Module):
   def __init__(self, state_dim, hidden=48, n_actions=3):
       tremendous().__init__()
       self.web = nn.Sequential(
           nn.Linear(state_dim, hidden),
           nn.Tanh(),
           nn.Linear(hidden, hidden),
           nn.Tanh(),
           nn.Linear(hidden, n_actions)
       )
   def ahead(self, x):
       return self.web(x)


coverage = PolicyInternet(STATE_DIM, hidden=48, n_actions=N_ACTIONS).to(machine)
optimizer = optim.Adam(coverage.parameters(), lr=3e-3)

We encode every job into a structured state that captures operands, operation sort, predicted issue, remaining finances, and up to date efficiency. We then outline a neural coverage community that maps this state to a likelihood distribution over actions. As we work via it, we see how the coverage turns into the core mechanism via which the agent learns to regulate its pondering. Check out the FULL CODE NOTEBOOK.

Copy Code

GAMMA = 0.98
COST_PENALTY = 0.25
MAX_BUDGET = 25.0
EPISODES = 600
STEPS_PER_EP = 20
ERROR_EMA_DECAY = 0.9


def run_episode(prepare=True):
   log_probs = []
   rewards = []
   data = []
   rem_budget = MAX_BUDGET
   error_ema = 0.0
   last_action = None


   for _ in vary(STEPS_PER_EP):
       a, b, op = make_task()
       state = encode_state(a, b, op, rem_budget, error_ema, last_action)
       logits = coverage(state)
       dist = torch.distributions.Categorical(logits=logits)
       motion = dist.pattern() if prepare else torch.argmax(logits)
       act_idx = int(motion.merchandise())


       if act_idx == 0:
           pred, price = fast_heuristic(a, b, op)
       elif act_idx == 1:
           pred, price = deep_chain_of_thought(a, b, op, verbose=False)
       else:
           pred, price = tool_solver(a, b, op)


       appropriate = (pred == true_answer(a, b, op))
       acc_reward = 1.0 if appropriate else 0.0
       budget_penalty = 0.0


       rem_budget -= price
       if rem_budget < 0:
           budget_penalty = -1.5 * (abs(rem_budget) / MAX_BUDGET)


       step_reward = acc_reward - COST_PENALTY * price + budget_penalty
       rewards.append(step_reward)


       if prepare:
           log_probs.append(dist.log_prob(motion))


       err = 0.0 if appropriate else 1.0
       error_ema = ERROR_EMA_DECAY * error_ema + (1 - ERROR_EMA_DECAY) * err
       last_action = act_idx


       data.append({
           "appropriate": appropriate,
           "price": price,
           "issue": true_difficulty(a, b, op),
           "motion": act_idx
       })


   if prepare:
       returns = []
       G = 0.0
       for r in reversed(rewards):
           G = r + GAMMA * G
           returns.append(G)
       returns = checklist(reversed(returns))
       returns_t = torch.tensor(returns, dtype=torch.float32, machine=machine)
       baseline = returns_t.imply()
       adv = returns_t - baseline
       loss = -(torch.stack(log_probs) * adv).imply()
       optimizer.zero_grad()
       loss.backward()
       optimizer.step()


   return rewards, data

We implement the center of studying utilizing the REINFORCE coverage gradient algorithm. We run multi-step episodes, acquire log-probabilities, accumulate rewards, and compute returns. As we execute this half, we watch the meta-controller alter its technique by reinforcing selections that steadiness accuracy with price. Check out the FULL CODE NOTEBOOK.

Copy Code

print("Training meta-cognitive controller...")
for ep in vary(EPISODES):
   rewards, _ = run_episode(prepare=True)
   if (ep + 1) % 100 == 0:
       print(f" episode {ep+1:4d} | avg reward {np.imply(rewards):.3f}")


def consider(n_episodes=50):
   all_actions = {0: [0,0,0], 1: [0,0,0], 2: [0,0,0]}
   stats = {0: {"n":0,"acc":0,"price":0},
            1: {"n":0,"acc":0,"price":0},
            2: {"n":0,"acc":0,"price":0}}


   for _ in vary(n_episodes):
       _, data = run_episode(prepare=False)
       for step in data:
           d = step["difficulty"]
           a_idx = step["action"]
           all_actions[d][a_idx] += 1
           stats[d]["n"] += 1
           stats[d]["acc"] += 1 if step["correct"] else 0
           stats[d]["cost"] += step["cost"]


   for d in [0,1,2]:
       if stats[d]["n"] == 0:
           proceed
       n = stats[d]["n"]
       print(f"Difficulty {d}:")
       print(" motion counts [fast, deep, tool]:", all_actions[d])
       print(" accuracy:", stats[d]["acc"]/n)
       print(" avg price:", stats[d]["cost"]/n)
       print()


print("Policy conduct by issue:")
consider()

We prepare the meta-cognitive agent over a whole lot of episodes and consider its conduct throughout issue ranges. We observe how the coverage evolves, utilizing quick heuristics for easy duties whereas resorting to deeper reasoning for more durable ones. As we analyze the outputs, we perceive how coaching shapes the agent’s reasoning decisions. Check out the FULL CODE NOTEBOOK.

Copy Code

print("nExample exhausting job with meta-selected pondering mode:")
a, b, op = 47, 18, '*'
state = encode_state(a, b, op, MAX_BUDGET, 0.3, None)
with torch.no_grad():
   logits = coverage(state)
   act = int(torch.argmax(logits).merchandise())


print(f"Task: {a} {op} {b}")
print("Chosen mode:", ACTION_NAMES[act])


if act == 1:
   pred, price = deep_chain_of_thought(a, b, op, verbose=True)
elif act == 0:
   pred, price = fast_heuristic(a, b, op)
   print("Fast heuristic:", pred)
else:
   pred, price = tool_solver(a, b, op)
   print("Tool solver:", pred)


print("True:", true_answer(a,b,op), "| price:", price)

We examine a detailed reasoning hint for a exhausting instance chosen by the skilled coverage. We see the agent confidently choose a mode and stroll via the reasoning steps, permitting us to witness its meta-cognitive conduct in motion. As we check totally different duties, we admire how the mannequin adapts its pondering based mostly on context.

In conclusion, we’ve seen how a neural controller can study to dynamically select the best reasoning pathway based mostly on the duty’s issue and the constraints of the second. We observe how the agent step by step discovers when fast heuristics are adequate, when deeper reasoning is important, and when calling a exact solver is price the associated fee. Through this course of, we expertise how metacognitive management transforms decision-making, main to extra environment friendly and adaptable reasoning programs.

Check out the FULL CODE NOTEBOOK. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t neglect to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish How to Build a Meta-Cognitive AI Agent That Dynamically Adjusts Its Own Reasoning Depth for Efficient Problem Solving appeared first on MarkTechPost.