An Implementation of a Comprehensive Empirical Framework for Benchmarking Reasoning Strategies in Modern Agentic AI Systems
In this tutorial, we dive deep into how we systematically benchmark agentic elements by evaluating a number of reasoning methods throughout numerous duties. We discover how completely different architectures, reminiscent of Direct, Chain-of-Thought, ReAct, and Reflexion, behave when confronted with issues of growing issue, and we quantify their accuracy, effectivity, latency, and tool-usage patterns. By conducting managed empirical research, we achieve a clearer understanding of why sure agentic methods succeed, the place they fail, and the way they commerce off pace for depth of reasoning. Check out the FULL CODES here.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Callable, Tuple
from dataclasses import dataclass
from enum import Enum
import time
from collections import defaultdict
class ReasoningStrategy(Enum):
DIRECT = "direct"
CHAIN_OF_THOUGHT = "chain_of_thought"
REACT = "react"
REFLEXION = "reflexion"
@dataclass
class AgentResponse:
reply: str
steps: int
time_taken: float
tool_calls: int
confidence: float
class BaseAgent:
def __init__(self, technique: ReasoningStrategy):
self.technique = technique
self.tool_count = 0
def remedy(self, downside: str) -> AgentResponse:
start_time = time.time()
if self.technique == ReasoningStrategy.DIRECT:
reply, steps, instruments = self._direct_solve(downside)
elif self.technique == ReasoningStrategy.CHAIN_OF_THOUGHT:
reply, steps, instruments = self._cot_solve(downside)
elif self.technique == ReasoningStrategy.REACT:
reply, steps, instruments = self._react_solve(downside)
else:
reply, steps, instruments = self._reflexion_solve(downside)
time_taken = time.time() - start_time
confidence = self._calculate_confidence(downside, reply)
return AgentResponse(reply, steps, time_taken, instruments, confidence)
We arrange the muse of our benchmarking framework by importing important libraries and defining the core agent architectures. We set up completely different reasoning methods and assemble the BaseAgent class, giving ourselves a versatile construction to simulate numerous agentic behaviors. Through this setup, we set up a unified interface that every one brokers comply with throughout analysis. Check out the FULL CODES here.
def _direct_solve(self, downside: str) -> Tuple[str, int, int]:
reply = self._compute_answer(downside)
return reply, 1, 0
def _cot_solve(self, downside: str) -> Tuple[str, int, int]:
steps = 3 + len(downside.break up()) // 5
for i in vary(steps):
_ = self._reason_step(downside, i)
reply = self._compute_answer(downside)
return reply, steps, 0
def _react_solve(self, downside: str) -> Tuple[str, int, int]:
steps = 4
tool_calls = 2
for i in vary(steps):
_ = self._reason_step(downside, i)
if i % 2 == 0:
self._use_tool(downside)
reply = self._compute_answer(downside)
return reply, steps, tool_calls
def _reflexion_solve(self, downside: str) -> Tuple[str, int, int]:
steps = 6
tool_calls = 1
initial_answer = self._compute_answer(downside)
reflection = self._reflect(downside, initial_answer)
reply = self._refine(downside, initial_answer, reflection)
return reply, steps, tool_calls
def _reason_step(self, downside: str, step: int) -> str:
return f"Analyzing side {step+1}"
def _use_tool(self, downside: str):
self.tool_count += 1
time.sleep(0.001)
def _compute_answer(self, downside: str) -> str:
return f"Solution_{hash(downside) % 100}"
def _reflect(self, downside: str, reply: str) -> str:
return "Reflection on method"
def _refine(self, downside: str, reply: str, reflection: str) -> str:
return f"Refined_{reply}"
def _calculate_confidence(self, downside: str, reply: str) -> float:
base_confidence = 0.7
strategy_bonus = {
ReasoningStrategy.DIRECT: 0.0,
ReasoningStrategy.CHAIN_OF_THOUGHT: 0.1,
ReasoningStrategy.REACT: 0.15,
ReasoningStrategy.REFLEXION: 0.2
}
return min(1.0, base_confidence + strategy_bonus[self.strategy] + np.random.uniform(-0.1, 0.1))
We implement how every reasoning technique behaves internally, together with direct answering, chain-of-thought reasoning, ReAct-style interleaving, and Reflexion-based refinement. We simulate reasoning steps, instrument utilization, and confidence estimation to seize practical agent habits patterns. Here, we form the dynamic character of every agentic technique we benchmark. Check out the FULL CODES here.
class BenchmarkTask:
def __init__(self, title: str, issue: float, ground_truth: str):
self.title = title
self.issue = issue
self.ground_truth = ground_truth
def consider(self, response: AgentResponse) -> Dict[str, float]:
accuracy = response.confidence * (1 - self.issue * 0.3)
return {
'accuracy': accuracy,
'effectivity': 1.0 / (response.steps + 1),
'latency': response.time_taken,
'tool_efficiency': 1.0 / (response.tool_calls + 1)
}
class BenchmarkSuite:
def __init__(self):
self.duties = self._create_tasks()
def _create_tasks(self) -> List[BenchmarkTask]:
duties = []
task_types = [
("Math_Problem", 0.3),
("Logic_Puzzle", 0.5),
("Code_Debug", 0.6),
("Complex_Reasoning", 0.8),
("Multi_Step_Planning", 0.7)
]
for i, (task_type, issue) in enumerate(task_types):
for j in vary(3):
activity = BenchmarkTask(
title=f"{task_type}_{j+1}",
issue=issue + np.random.uniform(-0.1, 0.1),
ground_truth=f"GT_{i}_{j}"
)
duties.append(activity)
return duties
def run_benchmark(self, brokers: List[BaseAgent]) -> pd.DataFrame:
outcomes = []
for agent in brokers:
for activity in self.duties:
response = agent.remedy(activity.title)
metrics = activity.consider(response)
outcomes.append({
'technique': agent.technique.worth,
'activity': activity.title,
'issue': activity.issue,
'accuracy': metrics['accuracy'],
'effectivity': metrics['efficiency'],
'latency': metrics['latency'],
'tool_efficiency': metrics['tool_efficiency'],
'steps': response.steps,
'tool_calls': response.tool_calls
})
return pd.DataFrame(outcomes)
We construct the entire benchmark suite that generates duties, executes them throughout a number of brokers, and collects standardized outcomes. We design assorted activity varieties and issue ranges to watch how every reasoning technique adapts underneath stress. This snippet permits us to create a reproducible and systematic analysis pipeline. Check out the FULL CODES here.
def analyze_results(df: pd.DataFrame):
agg_metrics = df.groupby('technique').agg({
'accuracy': ['mean', 'std'],
'effectivity': ['mean', 'std'],
'latency': ['mean', 'std'],
'steps': 'imply',
'tool_calls': 'imply'
}).spherical(3)
print(agg_metrics)
diff_bins = pd.reduce(df['difficulty'], bins=3, labels=['Easy', 'Medium', 'Hard'])
diff_analysis = df.groupby(['strategy', diff_bins])['accuracy'].imply().unstack()
print(diff_analysis.spherical(3))
tradeoff = df.groupby('technique').agg({
'accuracy': 'imply',
'steps': 'imply',
'latency': 'imply'
})
tradeoff['score'] = (tradeoff['accuracy'] / (tradeoff['steps'] * tradeoff['latency'])).spherical(3)
print(tradeoff.spherical(3))
def visualize_results(df: pd.DataFrame):
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
sns.barplot(knowledge=df, x='technique', y='accuracy', ax=axes[0, 0], errorbar='sd')
axes[0, 0].set_title('Accuracy by Strategy')
axes[0, 0].tick_params(axis='x', rotation=45)
for technique in df['strategy'].distinctive():
strategy_df = df[df['strategy'] == technique]
axes[0, 1].scatter(strategy_df['steps'], strategy_df['accuracy'], label=technique, alpha=0.6, s=50)
axes[0, 1].set_title('Steps vs Accuracy')
axes[0, 1].legend()
difficulty_bins = pd.reduce(df['difficulty'], bins=3, labels=['Easy', 'Medium', 'Hard'])
df_plot = df.copy()
df_plot['difficulty_bin'] = difficulty_bins
sns.boxplot(knowledge=df_plot, x='difficulty_bin', y='accuracy', hue='technique', ax=axes[1, 0])
axes[1, 0].set_title('Performance vs Difficulty')
scores = df.groupby('technique').apply(
lambda x: x['accuracy'].imply() / (x['steps'].imply() * x['latency'].imply())
).sort_values()
axes[1, 1].barh(vary(len(scores)), scores.values)
axes[1, 1].set_yticks(vary(len(scores)))
axes[1, 1].set_yticklabels(scores.index)
axes[1, 1].set_title('Overall Efficiency Score')
plt.tight_layout()
plt.present()
We carry out detailed evaluation and visualization to grasp how methods differ throughout metrics like accuracy, effectivity, and latency. We combination outcomes, evaluate efficiency throughout issue ranges, and visualize trade-offs to uncover deeper insights. This step empowers us to interpret the outcomes relatively than simply compute them. Check out the FULL CODES here.
if __name__ == "__main__":
brokers = [
BaseAgent(ReasoningStrategy.DIRECT),
BaseAgent(ReasoningStrategy.CHAIN_OF_THOUGHT),
BaseAgent(ReasoningStrategy.REACT),
BaseAgent(ReasoningStrategy.REFLEXION)
]
suite = BenchmarkSuite()
results_df = suite.run_benchmark(brokers)
analyze_results(results_df)
visualize_results(results_df)
print("1. Advanced methods obtain greater accuracy however require extra steps")
print("2. Chain-of-thought balances accuracy and effectivity")
print("3. Direct is quickest however much less dependable on onerous duties")
print("4. All methods degrade on tougher duties however superior ones degrade slowly")
We convey every part collectively by operating the benchmark suite on all brokers and printing the important thing findings. We execute the evaluation pipeline, visualize comparative outcomes, and interpret how methods behave underneath similar situations. This snippet completes the loop, permitting us to watch empirical patterns and derive significant conclusions.
In conclusion, we observe how completely different agentic reasoning paradigms carry out when subjected to similar benchmark situations, and we achieve sensible perception into how these methods scale with growing complexity. As we analyze patterns in accuracy, step rely, latency, and gear effectivity, we acknowledge how superior methods succeed via deeper reasoning whereas incurring computational overhead. We now stand outfitted with a structured empirical framework that helps us evaluate, debug, and optimize agentic behaviors, permitting us to construct extra succesful, data-driven agentic methods.
Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The put up An Implementation of a Comprehensive Empirical Framework for Benchmarking Reasoning Strategies in Modern Agentic AI Systems appeared first on MarkTechPost.
