How to Design a Mini Reinforcement Learning Environment-Acting Agent with Intelligent Local Feedback, Adaptive Decision-Making, and Multi-Agent Coordination

In this tutorial, we code a mini reinforcement studying setup wherein a multi-agent system learns to navigate a grid world by way of interplay, suggestions, and layered decision-making. We construct all the things from scratch and convey collectively three agent roles: an Action Agent, a Tool Agent, and a Supervisor, so we will observe how easy heuristics, evaluation, and oversight mix to produce extra clever habits. Also, we observe how the brokers collaborate, refine their methods, and regularly be taught to attain the purpose whereas overcoming obstacles and uncertainty. Check out the FULL CODES here.

Copy Code

import numpy as np
import matplotlib.pyplot as plt
from IPython.show import clear_output
import time
from collections import defaultdict


class GridWorld:
   def __init__(self, dimension=8):
       self.dimension = dimension
       self.agent_pos = [0, 0]
       self.goal_pos = [size-1, size-1]
       self.obstacles = self._generate_obstacles()
       self.visited = set()
       self.step_count = 0
       self.max_steps = dimension * dimension * 2
      
   def _generate_obstacles(self):
       obstacles = set()
       n_obstacles = self.dimension
       whereas len(obstacles) < n_obstacles:
           pos = (np.random.randint(1, self.size-1),
                  np.random.randint(1, self.size-1))
           if pos != (0, 0) and pos != (self.size-1, self.size-1):
               obstacles.add(pos)
       return obstacles
  
   def reset(self):
       self.agent_pos = [0, 0]
       self.visited = {tuple(self.agent_pos)}
       self.step_count = 0
       return self._get_state()
  
   def _get_state(self):
       return {
           'place': tuple(self.agent_pos),
           'purpose': self.goal_pos,
           'distance_to_goal': abs(self.agent_pos[0] - self.goal_pos[0]) +
                               abs(self.agent_pos[1] - self.goal_pos[1]),
           'visited_count': len(self.visited),
           'steps': self.step_count,
           'can_move': self._get_valid_actions()
       }
  
   def _get_valid_actions(self):
       legitimate = []
       strikes = {'up': [-1, 0], 'down': [1, 0], 'left': [0, -1], 'proper': [0, 1]}
       for motion, delta in strikes.objects():
           new_pos = [self.agent_pos[0] + delta[0], self.agent_pos[1] + delta[1]]
           if (0 <= new_pos[0] < self.dimension and 0 <= new_pos[1] < self.dimension and
               tuple(new_pos) not in self.obstacles):
               legitimate.append(motion)
       return legitimate

We arrange the complete GridWorld surroundings and outline how the agent, purpose, and obstacles exist in it. We set up the construction for state illustration and legitimate actions, and we put together the surroundings so we will work together with it dynamically. As we run this half, we see the world taking form and turning into prepared for the brokers to discover. Check out the FULL CODES here.

Copy Code

class GridWorld(GridWorld):
   def step(self, motion):
       self.step_count += 1
       strikes = {'up': [-1, 0], 'down': [1, 0], 'left': [0, -1], 'proper': [0, 1]}
      
       if motion not in strikes:
           return self._get_state(), -1, False, "Invalid motion"
      
       delta = strikes[action]
       new_pos = [self.agent_pos[0] + delta[0], self.agent_pos[1] + delta[1]]
      
       if not (0 <= new_pos[0] < self.dimension and 0 <= new_pos[1] < self.dimension):
           return self._get_state(), -1, False, "Hit wall"
      
       if tuple(new_pos) in self.obstacles:
           return self._get_state(), -1, False, "Hit impediment"
      
       self.agent_pos = new_pos
       pos_tuple = tuple(self.agent_pos)
       reward = -0.1
       if pos_tuple not in self.visited:
           reward += 0.5
           self.visited.add(pos_tuple)
      
       finished = False
       information = "Moved"
       if self.agent_pos == self.goal_pos:
           reward += 10
           finished = True
           information = "Goal reached!"
       elif self.step_count >= self.max_steps:
           finished = True
           information = "Max steps reached"
      
       return self._get_state(), reward, finished, information
  
   def render(self, agent_thoughts=None):
       grid = np.zeros((self.dimension, self.dimension, 3))
       for pos in self.visited:
           grid[pos[0], pos[1]] = [0.7, 0.9, 1.0]
       for obs in self.obstacles:
           grid[obs[0], obs[1]] = [0.2, 0.2, 0.2]
       grid[self.goal_pos[0], self.goal_pos[1]] = [0, 1, 0]
       grid[self.agent_pos[0], self.agent_pos[1]] = [1, 0, 0]
      
       plt.determine(figsize=(10, 8))
       plt.imshow(grid, interpolation='nearest')
       plt.title(f"Step: {self.step_count} | Visited: {len(self.visited)}/{self.dimension*self.dimension}")
       for i in vary(self.dimension + 1):
           plt.axhline(i - 0.5, colour='grey', linewidth=0.5)
           plt.axvline(i - 0.5, colour='grey', linewidth=0.5)
       if agent_thoughts:
           plt.textual content(0.5, -1.5, agent_thoughts, ha='middle', fontsize=9,
                    bbox=dict(boxstyle='spherical', facecolor='wheat', alpha=0.8),
                    wrap=True, rework=plt.gca().transData)
       plt.axis('off')
       plt.tight_layout()
       plt.present()

We outline how every step within the surroundings works and how the world is visually rendered. We calculate rewards, detect collisions, monitor progress, and show all the things by way of a clear grid visualization. As we execute this logic, we watch the agent’s journey unfold in actual time with clear suggestions. Check out the FULL CODES here.

Copy Code

class ActionAgent:
   def __init__(self):
       self.q_values = defaultdict(lambda: defaultdict(float))
       self.epsilon = 0.3
       self.learning_rate = 0.1
       self.low cost = 0.95
  
   def choose_action(self, state):
       valid_actions = state['can_move']
       if not valid_actions:
           return None
       pos = state['position']
       if np.random.random() < self.epsilon:
           motion = np.random.selection(valid_actions)
           reasoning = f"Exploring randomly: selected '{motion}'"
       else:
           action_values = {a: self.q_values[pos][a] for a in valid_actions}
           motion = max(action_values, key=action_values.get)
           reasoning = f"Exploiting: selected '{motion}' (Q={self.q_values[pos][action]:.2f})"
       return motion, reasoning
  
   def be taught(self, state, motion, reward, next_state):
       pos = state['position']
       next_pos = next_state['position']
       current_q = self.q_values[pos][action]
       next_max_q = max([self.q_values[next_pos][a] for a in next_state['can_move']], default=0)
       new_q = current_q + self.learning_rate * (
           reward + self.low cost * next_max_q - current_q)
       self.q_values[pos][action] = new_q


class ToolAgent:
   def analyze(self, state, action_taken, reward, historical past):
       solutions = []
       distance = state['distance_to_goal']
       if distance <= 3:
           solutions.append(" Very shut to purpose! Prioritize direct path.")
       exploration_rate = state['visited_count'] / (state['steps'] + 1)
       if exploration_rate < 0.5 and distance > 5:
           solutions.append(" Low exploration fee. Consider exploring extra.")
       if len(historical past) >= 5:
           recent_rewards = [h[2] for h in historical past[-5:]]
           avg_reward = np.imply(recent_rewards)
           if avg_reward < -0.5:
               solutions.append(" Negative reward pattern. Try completely different technique.")
           elif avg_reward > 0.3:
               solutions.append(" Good progress! Current technique working.")
       if len(state['can_move']) <= 2:
           solutions.append(" Limited motion choices. Be cautious.")
       return solutions

We implement the Action Agent and Tool Agent, giving the system each studying functionality and analytical suggestions. We observe how the Action Agent chooses actions by way of a stability of exploration and exploitation, whereas the Tool Agent evaluates efficiency and suggests enhancements. Together, they create a studying loop that evolves with expertise. Check out the FULL CODES here.

Copy Code

class SupervisorAgent:
   def resolve(self, state, proposed_action, tool_suggestions):
       if not proposed_action:
           return None, "No legitimate actions out there"
      
       determination = proposed_action
       reasoning = f"Approved motion '{proposed_action}'"
      
       for suggestion in tool_suggestions:
           if "purpose" in suggestion.decrease() and "shut" in suggestion.decrease():
               goal_direction = self._get_goal_direction(state)
               if goal_direction in state['can_move']:
                   determination = goal_direction
                   reasoning = f"Override: Moving '{goal_direction}' towards purpose"
                   break
      
       return determination, reasoning
  
   def _get_goal_direction(self, state):
       pos = state['position']
       purpose = state['goal']
       if purpose[0] > pos[0]:
           return 'down'
       elif purpose[0] < pos[0]:
           return 'up'
       elif purpose[1] > pos[1]:
           return 'proper'
       else:
           return 'left'

We introduce the Supervisor Agent, which acts as the ultimate decision-maker within the system. We see the way it interprets solutions, overrides dangerous decisions, and ensures that actions stay aligned with general objectives. As we use this part, we expertise a coordinated multi-agent determination movement. Check out the FULL CODES here.

Copy Code

def train_multi_agent(episodes=5, visualize=True):
   env = GridWorld(dimension=8)
   action_agent = ActionAgent()
   tool_agent = ToolAgent()
   supervisor = SupervisorAgent()
  
   episode_rewards = []
   episode_steps = []
  
   for episode in vary(episodes):
       state = env.reset()
       total_reward = 0
       finished = False
       historical past = []
      
       print(f"n{'='*60}")
       print(f"EPISODE {episode + 1}/{episodes}")
       print(f"{'='*60}")
      
       whereas not finished:
           action_result = action_agent.choose_action(state)
           if action_result is None:
               break
           proposed_action, action_reasoning = action_result
          
           solutions = tool_agent.analyze(state, proposed_action, total_reward, historical past)
           final_action, supervisor_reasoning = supervisor.resolve(state, proposed_action, solutions)
          
           if final_action is None:
               break
          
           next_state, reward, finished, information = env.step(final_action)
           total_reward += reward
           action_agent.be taught(state, final_action, reward, next_state)
           historical past.append((state, final_action, reward, next_state))
          
           if visualize:
               clear_output(wait=True)
               ideas = (f"Action Agent: {action_reasoning}n"
                          f"Supervisor: {supervisor_reasoning}n"
                          f"Tool Agent: {', '.be a part of(solutions[:2]) if solutions else 'No solutions'}n"
                          f"Reward: {reward:.2f} | Total: {total_reward:.2f}")
               env.render(ideas)
               time.sleep(0.3)
          
           state = next_state
      
       episode_rewards.append(total_reward)
       episode_steps.append(env.step_count)
      
       print(f"nEpisode {episode+1} Complete!")
       print(f"Total Reward: {total_reward:.2f}")
       print(f"Steps Taken: {env.step_count}")
       print(f"Cells Visited: {len(env.visited)}/{env.dimension**2}")
  
   plt.determine(figsize=(12, 4))
   plt.subplot(1, 2, 1)
   plt.plot(episode_rewards, marker='o')
   plt.title('Episode Rewards')
   plt.xlabel('Episode')
   plt.ylabel('Total Reward')
   plt.grid(True, alpha=0.3)
  
   plt.subplot(1, 2, 2)
   plt.plot(episode_steps, marker='s', colour='orange')
   plt.title('Episode Steps')
   plt.xlabel('Episode')
   plt.ylabel('Steps to Complete')
   plt.grid(True, alpha=0.3)
   plt.tight_layout()
   plt.present()
  
   return action_agent, tool_agent, supervisor


if __name__ == "__main__":
   print(" Multi-Agent RL System: Grid World Navigation")
   print("=" * 60)
   print("Components:")
   print("  • Action Agent: Proposes actions utilizing Q-learning")
   print("  • Tool Agent: Analyzes efficiency and suggests enhancements")
   print("  • Supervisor Agent: Makes last choices")
   print("=" * 60)
  
   trained_agents = train_multi_agent(episodes=5, visualize=True)

We run the complete coaching loop the place all brokers collaborate contained in the surroundings throughout a number of episodes. We monitor rewards, observe motion patterns, and visualize studying development with every trial. As we full this loop, we see the multi-agent system bettering and turning into extra environment friendly at navigating the grid world.

In conclusion, we see how a multi-agent RL system emerges from clear parts and how every layer contributes to smarter navigation: the Action Agent learns through Q-updates, the Tool Agent guides enhancements, and the Supervisor ensures secure, goal-oriented motion choice. We respect how this straightforward but dynamic grid world helps us visualize studying, exploration, and decision-making in actual time.

Check out the FULL CODES here. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t overlook to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish How to Design a Mini Reinforcement Learning Environment-Acting Agent with Intelligent Local Feedback, Adaptive Decision-Making, and Multi-Agent Coordination appeared first on MarkTechPost.

How to Design a Mini Reinforcement Learning Environment-Acting Agent with Intelligent Local Feedback, Adaptive Decision-Making, and Multi-Agent Coordination

Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model that can Execute up to 200–300 Sequential Tool Calls without Human Interference

A Coding Implementation of an Advanced Tool-Using AI Agent with Semantic Kernel and Gemini

Google vs OpenAI vs Anthropic: The Agentic AI Arms Race Breakdown

DeepSeek AI Researchers Introduce Engram: A Conditional Memory Axis For Sparse LLMs

How to Build Transparent AI Agents: Traceable Decision-Making with Audit Trails and Human Gates

CloudFlare AI Team Just Open-Sourced ‘VibeSDK’ that Lets Anyone Build and Deploy a Full AI Vibe Coding Platform with a Single Click

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!