How Exploration Agents like Q-Learning, UCB, and MCTS Collaboratively Learn Intelligent Problem-Solving Strategies in Dynamic Grid Environments

In this tutorial, we discover how exploration methods form clever decision-making via agent-based downside fixing. We construct and prepare three brokers, Q-Learning with epsilon-greedy exploration, Upper Confidence Bound (UCB), and Monte Carlo Tree Search (MCTS), to navigate a grid world and attain a aim effectively whereas avoiding obstacles. Also, we experiment with alternative ways of balancing exploration and exploitation, visualize studying curves, and examine how every agent adapts and performs beneath uncertainty. Check out the FULL CODES here.

Copy Code

import numpy as np
import random
from collections import defaultdict, deque
import math
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict


class GridWorld:
   def __init__(self, measurement=10, n_obstacles=15):
       self.measurement = measurement
       self.grid = np.zeros((measurement, measurement))
       self.begin = (0, 0)
       self.aim = (size-1, size-1)
       obstacles = set()
       whereas len(obstacles) < n_obstacles:
           obs = (random.randint(0, size-1), random.randint(0, size-1))
           if obs not in [self.start, self.goal]:
               obstacles.add(obs)
               self.grid[obs] = 1
       self.reset()
   def reset(self):
       self.agent_pos = self.begin
       return self.agent_pos
   def step(self, motion):
       if self.agent_pos == self.aim:
           reward, achieved = 100, True
       else:
           reward, achieved = -1, False
       return self.agent_pos, reward, achieved
   def get_valid_actions(self, state):
       legitimate = []
       for i, transfer in enumerate(strikes):
           new_pos = (state[0] + transfer[0], state[1] + transfer[1])
           if (0 <= new_pos[0] < self.measurement and 0 <= new_pos[1] < self.measurement
               and self.grid[new_pos] == 0):
               legitimate.append(i)
       return legitimate

We start by making a grid world setting that challenges our agent to succeed in a aim whereas avoiding obstacles. We design its construction, outline motion guidelines, and guarantee real looking navigation boundaries to simulate an interactive problem-solving area. This kinds the muse the place our exploration brokers will function and study. Check out the FULL CODES here.

Copy Code

class QLearningAgent:
   def __init__(self, n_actions=4, alpha=0.1, gamma=0.95, epsilon=1.0):
       self.n_actions = n_actions
       self.alpha = alpha
       self.gamma = gamma
       self.epsilon = epsilon
       self.q_table = defaultdict(lambda: np.zeros(n_actions))
   def get_action(self, state, valid_actions):
       if random.random() < self.epsilon:
           return random.selection(valid_actions)
       else:
           q_values = self.q_table[state]
           valid_q = [(a, q_values[a]) for a in valid_actions]
           return max(valid_q, key=lambda x: x[1])[0]
   def replace(self, state, motion, reward, next_state, valid_next_actions):
       current_q = self.q_table[state][action]
       if valid_next_actions:
           max_next_q = max([self.q_table[next_state][a] for a in valid_next_actions])
       else:
           max_next_q = 0
       new_q = current_q + self.alpha * (reward + self.gamma * max_next_q - current_q)
       self.q_table[state][action] = new_q
   def decay_epsilon(self, decay_rate=0.995):
       self.epsilon = max(0.01, self.epsilon * decay_rate)

We implement the Q-Learning agent that learns via expertise, guided by an epsilon-greedy coverage. We observe the way it explores random actions early on and step by step focuses on probably the most rewarding paths. Through iterative updates, it learns to stability exploration and exploitation successfully.

Copy Code

class UCBAgent:
   def __init__(self, n_actions=4, c=2.0, gamma=0.95):
       self.n_actions = n_actions
       self.c = c
       self.gamma = gamma
       self.q_values = defaultdict(lambda: np.zeros(n_actions))
       self.action_counts = defaultdict(lambda: np.zeros(n_actions))
       self.total_counts = defaultdict(int)
   def get_action(self, state, valid_actions):
       self.total_counts[state] += 1
       ucb_values = []
       for motion in valid_actions:
           q = self.q_values[state][action]
           depend = self.action_counts[state][action]
           if depend == 0:
               return motion
           exploration_bonus = self.c * math.sqrt(math.log(self.total_counts[state]) / depend)
           ucb_values.append((motion, q + exploration_bonus))
       return max(ucb_values, key=lambda x: x[1])[0]
   def replace(self, state, motion, reward, next_state, valid_next_actions):
       self.action_counts[state][action] += 1
       depend = self.action_counts[state][action]
       current_q = self.q_values[state][action]
       if valid_next_actions:
           max_next_q = max([self.q_values[next_state][a] for a in valid_next_actions])
       else:
           max_next_q = 0
       goal = reward + self.gamma * max_next_q
       self.q_values[state][action] += (goal - current_q) / depend

We develop the UCB agent that makes use of confidence bounds to information its exploration choices. We watch the way it strategically tries less-visited actions whereas prioritizing those who yield increased rewards. This method helps us perceive a extra mathematically grounded exploration technique. Check out the FULL CODES here.

Copy Code

class MCTSNode:
   def __init__(self, state, father or mother=None):
       self.state = state
       self.father or mother = father or mother
       self.youngsters = {}
       self.visits = 0
       self.worth = 0.0
   def is_fully_expanded(self, valid_actions):
       return len(self.youngsters) == len(valid_actions)
   def best_child(self, c=1.4):
       decisions = [(action, child.value / child.visits +
                   c * math.sqrt(2 * math.log(self.visits) / child.visits))
                  for action, child in self.children.items()]
       return max(decisions, key=lambda x: x[1])


class MCTSAgent:
   def __init__(self, env, n_simulations=50):
       self.env = env
       self.n_simulations = n_simulations
   def search(self, state):
       root = MCTSNode(state)
       for _ in vary(self.n_simulations):
           node = root
           sim_env = GridWorld(measurement=self.env.measurement)
           sim_env.grid = self.env.grid.copy()
           sim_env.agent_pos = state
           whereas node.is_fully_expanded(sim_env.get_valid_actions(node.state)) and node.youngsters:
               motion, _ = node.best_child()
               node = node.youngsters[action]
               sim_env.agent_pos = node.state
           valid_actions = sim_env.get_valid_actions(node.state)
           if valid_actions and not node.is_fully_expanded(valid_actions):
               untried = [a for a in valid_actions if a not in node.children]
               motion = random.selection(untried)
               next_state, _, _ = sim_env.step(motion)
               baby = MCTSNode(next_state, father or mother=node)
               node.youngsters[action] = baby
               node = baby
           total_reward = 0
           depth = 0
           whereas depth < 20:
               legitimate = sim_env.get_valid_actions(sim_env.agent_pos)
               if not legitimate:
                   break
               motion = random.selection(legitimate)
               _, reward, achieved = sim_env.step(motion)
               total_reward += reward
               depth += 1
               if achieved:
                   break
           whereas node:
               node.visits += 1
               node.worth += total_reward
               node = node.father or mother
       if root.youngsters:
           return max(root.youngsters.objects(), key=lambda x: x[1].visits)[0]
       return random.selection(self.env.get_valid_actions(state))

We assemble the Monte Carlo Tree Search (MCTS) agent to simulate and plan a number of potential future outcomes. We see the way it builds a search tree, expands promising branches, and backpropagates outcomes to refine choices. This permits the agent to plan intelligently earlier than performing. Check out the FULL CODES here.

Copy Code

def train_agent(agent, env, episodes=500, max_steps=100, agent_type="customary"):
   rewards_history = []
   for episode in vary(episodes):
       state = env.reset()
       total_reward = 0
       for step in vary(max_steps):
           valid_actions = env.get_valid_actions(state)
           if agent_type == "mcts":
               motion = agent.search(state)
           else:
               motion = agent.get_action(state, valid_actions)
           next_state, reward, achieved = env.step(motion)
           total_reward += reward
           if agent_type != "mcts":
               valid_next = env.get_valid_actions(next_state)
               agent.replace(state, motion, reward, next_state, valid_next)
           state = next_state
           if achieved:
               break
       rewards_history.append(total_reward)
       if hasattr(agent, 'decay_epsilon'):
           agent.decay_epsilon()
       if (episode + 1) % 100 == 0:
           avg_reward = np.imply(rewards_history[-100:])
           print(f"Episode {episode+1}/{episodes}, Avg Reward: {avg_reward:.2f}")
   return rewards_history


if __name__ == "__main__":
   print("=" * 70)
   print("Problem Solving through Exploration Agents Tutorial")
   print("=" * 70)
   env = GridWorld(measurement=8, n_obstacles=10)
   agents_config = {
       'Q-Learning (ε-greedy)': (QLearningAgent(), 'customary'),
       'UCB Agent': (UCBAgent(), 'customary'),
       'MCTS Agent': (MCTSAgent(env, n_simulations=30), 'mcts')
   }
   outcomes = {}
   for identify, (agent, agent_type) in agents_config.objects():
       print(f"nTraining {identify}...")
       rewards = train_agent(agent, GridWorld(measurement=8, n_obstacles=10),
                             episodes=300, agent_type=agent_type)
       outcomes[name] = rewards
   plt.determine(figsize=(12, 5))
   plt.subplot(1, 2, 1)
   for identify, rewards in outcomes.objects():
       smoothed = np.convolve(rewards, np.ones(20)/20, mode='legitimate')
       plt.plot(smoothed, label=identify, linewidth=2)
   plt.xlabel('Episode')
   plt.ylabel('Reward (smoothed)')
   plt.title('Agent Performance Comparison')
   plt.legend()
   plt.grid(alpha=0.3)
   plt.subplot(1, 2, 2)
   for identify, rewards in outcomes.objects():
       avg_last_100 = np.imply(rewards[-100:])
       plt.bar(identify, avg_last_100, alpha=0.7)
   plt.ylabel('Average Reward (Last 100 Episodes)')
   plt.title('Final Performance')
   plt.xticks(rotation=15, ha='proper')
   plt.grid(axis='y', alpha=0.3)
   plt.tight_layout()
   plt.present()
   print("=" * 70)
   print("Tutorial Complete!")
   print("Key Concepts Demonstrated:")
   print("1. Epsilon-Greedy exploration")
   print("2. UCB technique")
   print("3. MCTS-based planning")
   print("=" * 70)

We prepare all three brokers in our grid world and visualize their studying progress and efficiency. We analyze how every technique, Q-Learning, UCB, and MCTS, adapts to the setting over time. Finally, we examine outcomes and achieve insights into which exploration method results in quicker, extra dependable problem-solving.

In conclusion, we efficiently applied and in contrast three exploration-driven brokers, every demonstrating a singular technique for fixing the identical navigation problem. We observe how epsilon-greedy permits gradual studying via randomness, UCB balances confidence with curiosity, and MCTS leverages simulated rollouts for foresight and planning. This train helps us admire how totally different exploration mechanisms affect convergence, adaptability, and effectivity in reinforcement studying.

Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish How Exploration Agents like Q-Learning, UCB, and MCTS Collaboratively Learn Intelligent Problem-Solving Strategies in Dynamic Grid Environments appeared first on MarkTechPost.

How Exploration Agents like Q-Learning, UCB, and MCTS Collaboratively Learn Intelligent Problem-Solving Strategies in Dynamic Grid Environments

A Coding Implementation of Secure AI Agent with Self-Auditing Guardrails, PII Redaction, and Safe Tool Access in Python

Anthropic Turns MCP Agents Into Code First Systems With ‘Code Execution With MCP’ Approach

A Coding Implementation of an Advanced Tool-Using AI Agent with Semantic Kernel and Gemini

AgentSociety: An Open Source AI Framework for Simulating Large-Scale Societal Interactions with LLM Agents

Mistral AI Ships Devstral 2 Coding Models And Mistral Vibe CLI For Agentic, Terminal Native Development

Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!