A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks

In this tutorial, we develop a complete benchmarking framework to consider varied sorts of agentic AI programs on real-world enterprise software program duties. We design a suite of numerous challenges, from knowledge transformation and API integration to workflow automation and efficiency optimization, and assess how varied brokers, together with rule-based, LLM-powered, and hybrid ones, carry out throughout these domains. By working structured benchmarks and visualizing key efficiency metrics, akin to accuracy, execution time, and success price, we acquire a deeper understanding of every agent’s strengths and trade-offs in enterprise environments. Check out the Full Codes here.

Copy Code

import json
import time
import random
from typing import Dict, List, Any, Callable
from dataclasses import dataclass, asdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


@dataclass
class Task:
   id: str
   identify: str
   description: str
   class: str
   complexity: int
   expected_output: Any


@dataclass
class BenchmarkResult:
   task_id: str
   agent_name: str
   success: bool
   execution_time: float
   accuracy: float
   error_message: str = ""


class EnterpriseTaskSuite:
   def __init__(self):
       self.duties = self._create_tasks()


   def _create_tasks(self) -> List[Task]:
       return [
           Task("data_transform", "CSV Data Transformation",
                "Transform customer data by aggregating sales", "data_processing", 3,
                {"total_sales": 15000, "avg_order": 750}),
           Task("api_integration", "REST API Integration",
                "Parse API response and extract key metrics", "integration", 2,
                {"status": "success", "active_users": 1250}),
           Task("workflow_automation", "Multi-Step Workflow",
                "Execute data validation -> processing -> reporting", "automation", 4,
                {"validated": True, "processed": 100, "report_generated": True}),
           Task("error_handling", "Error Recovery",
                "Handle malformed data gracefully", "reliability", 3,
                {"errors_caught": 5, "recovery_success": True}),
           Task("optimization", "Query Optimization",
                "Optimize database query performance", "performance", 5,
                {"execution_time_ms": 45, "rows_scanned": 1000}),
           Task("data_validation", "Schema Validation",
                "Validate data against business rules", "validation", 2,
                {"valid_records": 95, "invalid_records": 5}),
           Task("reporting", "Executive Dashboard",
                "Generate KPI summary report", "analytics", 3,
                {"revenue": 125000, "growth": 0.15, "customer_count": 450}),
           Task("integration_test", "System Integration",
                "Test end-to-end integration flow", "testing", 4,
                {"all_systems_connected": True, "latency_ms": 120}),
       ]


   def get_task(self, task_id: str) -> Task:
       return subsequent((t for t in self.duties if t.id == task_id), None)

We outline the core knowledge constructions for our benchmarking system. We create the Task and BenchmarkResult knowledge lessons and initialize the EnterpriseTaskSuite, which holds a number of enterprise-relevant duties akin to knowledge transformation, reporting, and integration. We laid the inspiration for constantly evaluating differing kinds of brokers throughout these duties. Check out the Full Codes here.

Copy Code

class BaseAgent:
   def __init__(self, identify: str):
       self.identify = identify


   def execute(self, activity: Task) -> Dict[str, Any]:
       elevate NotImplementedError


class RuleBasedAgent(BaseAgent):
   def execute(self, activity: Task) -> Dict[str, Any]:
       time.sleep(random.uniform(0.1, 0.3))
       if activity.class == "data_processing":
           return {"total_sales": 15000 + random.randint(-500, 500),
                   "avg_order": 750 + random.randint(-50, 50)}
       elif activity.class == "integration":
           return {"standing": "success", "active_users": 1250}
       elif activity.class == "automation":
           return {"validated": True, "processed": 98, "report_generated": True}
       else:
           return activity.expected_output

We introduce the bottom agent construction and implement the RuleBasedAgent, which mimics conventional automation logic utilizing predefined guidelines. We simulate how such brokers execute duties deterministically whereas sustaining pace and reliability, giving us a baseline for comparability with extra superior brokers. Check out the Full Codes here.

Copy Code

class LLMAgent(BaseAgent):
   def execute(self, activity: Task) -> Dict[str, Any]:
       time.sleep(random.uniform(0.2, 0.5))
       accuracy_boost = 0.95 if activity.complexity >= 4 else 0.90
       outcome = {}
       for key, worth in activity.expected_output.gadgets():
           if isinstance(worth, (int, float)):
               variation = worth * (1 - accuracy_boost)
               outcome[key] = worth + random.uniform(-variation, variation)
           else:
               outcome[key] = worth
       return outcome


class HybridAgent(BaseAgent):
   def execute(self, activity: Task) -> Dict[str, Any]:
       time.sleep(random.uniform(0.15, 0.35))
       if activity.complexity <= 2:
           return activity.expected_output
       else:
           outcome = {}
           for key, worth in activity.expected_output.gadgets():
               if isinstance(worth, (int, float)):
                   variation = worth * 0.03
                   outcome[key] = worth + random.uniform(-variation, variation)
               else:
                   outcome[key] = worth
           return outcome

We develop two clever agent sorts, the LLMAgent, representing reasoning-based AI programs, and the HybridAgent, which mixes rule-based precision with LLM adaptability. We design these brokers to present how learning-based strategies enhance activity accuracy, particularly for complicated enterprise workflows. Check out the Full Codes here.

Copy Code

class BenchmarkEngine:
   def __init__(self, task_suite: EnterpriseTaskSuite):
       self.task_suite = task_suite
       self.outcomes: List[BenchmarkResult] = []


   def run_benchmark(self, agent: BaseAgent, iterations: int = 3):
       print(f"n{'='*60}")
       print(f"Benchmarking Agent: {agent.identify}")
       print(f"{'='*60}")
       for activity in self.task_suite.duties:
           print(f"nTask: {activity.identify} (Complexity: {activity.complexity}/5)")
           for i in vary(iterations):
               outcome = self._execute_task(agent, activity, i+1)
               self.outcomes.append(outcome)
               standing = "✓ PASS" if outcome.success else "✗ FAIL"
               print(f"  Run {i+1}: {standing} | Time: {outcome.execution_time:.3f}s | Accuracy: {outcome.accuracy:.2%}")

Here, we construct the core of our benchmarking engine, which manages agent analysis throughout the outlined activity suite. We implement strategies to run every agent a number of occasions per activity, log outcomes, and measure key parameters like execution time and accuracy. This creates a systematic and repeatable benchmarking loop. Check out the Full Codes here.

Copy Code

 def _execute_task(self, agent: BaseAgent, activity: Task, run_num: int) -> BenchmarkResult:
       start_time = time.time()
       attempt:
           output = agent.execute(activity)
           execution_time = time.time() - start_time
           accuracy = self._calculate_accuracy(output, activity.expected_output)
           success = accuracy >= 0.85
           return BenchmarkResult(task_id=activity.id, agent_name=agent.identify, success=success,
                                  execution_time=execution_time, accuracy=accuracy)
       besides Exception as e:
           execution_time = time.time() - start_time
           return BenchmarkResult(task_id=activity.id, agent_name=agent.identify, success=False,
                                  execution_time=execution_time, accuracy=0.0, error_message=str(e))


   def _calculate_accuracy(self, output: Dict, anticipated: Dict) -> float:
       if not output:
           return 0.0
       scores = []
       for key, expected_val in anticipated.gadgets():
           if key not in output:
               scores.append(0.0)
               proceed
           actual_val = output[key]
           if isinstance(expected_val, bool):
               scores.append(1.0 if actual_val == expected_val else 0.0)
           elif isinstance(expected_val, (int, float)):
               diff = abs(actual_val - expected_val)
               tolerance = abs(expected_val * 0.1)
               rating = max(0, 1 - (diff / (tolerance + 1e-9)))
               scores.append(rating)
           else:
               scores.append(1.0 if actual_val == expected_val else 0.0)
       return np.imply(scores) if scores else 0.0

We outline the duty execution logic and the accuracy computation. We measure every agent’s efficiency by evaluating their outputs towards anticipated outcomes utilizing a scoring mechanism. This step ensures our benchmarking course of is quantitative and truthful, offering insights into how intently brokers align with enterprise expectations. Check out the Full Codes here.

Copy Code

 def generate_report(self):
       df = pd.DataBody([asdict(r) for r in self.results])
       print(f"n{'='*60}")
       print("BENCHMARK REPORT")
       print(f"{'='*60}n")
       for agent_name in df['agent_name'].distinctive():
           agent_df = df[df['agent_name'] == agent_name]
           print(f"{agent_name}:")
           print(f"  Success Rate: {agent_df['success'].imply():.1%}")
           print(f"  Avg Execution Time: {agent_df['execution_time'].imply():.3f}s")
           print(f"  Avg Accuracy: {agent_df['accuracy'].imply():.2%}n")
       return df


   def visualize_results(self, df: pd.DataBody):
       fig, axes = plt.subplots(2, 2, figsize=(14, 10))
       fig.suptitle('Enterprise Agent Benchmarking Results', fontsize=16, fontweight='daring')
       success_rate = df.groupby('agent_name')['success'].imply()
       axes[0, 0].bar(success_rate.index, success_rate.values, colour=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 0].set_title('Success Rate by Agent', fontweight='daring')
       axes[0, 0].set_ylabel('Success Rate')
       axes[0, 0].set_ylim(0, 1.1)
       for i, v in enumerate(success_rate.values):
           axes[0, 0].textual content(i, v + 0.02, f'{v:.1%}', ha='heart', fontweight='daring')
       time_data = df.groupby('agent_name')['execution_time'].imply()
       axes[0, 1].bar(time_data.index, time_data.values, colour=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 1].set_title('Average Execution Time', fontweight='daring')
       axes[0, 1].set_ylabel('Time (seconds)')
       for i, v in enumerate(time_data.values):
           axes[0, 1].textual content(i, v + 0.01, f'{v:.3f}s', ha='heart', fontweight='daring')
       df.boxplot(column='accuracy', by='agent_name', ax=axes[1, 0])
       axes[1, 0].set_title('Accuracy Distribution', fontweight='daring')
       axes[1, 0].set_xlabel('Agent')
       axes[1, 0].set_ylabel('Accuracy')
       plt.sca(axes[1, 0])
       plt.xticks(rotation=15)
       task_complexity = {t.id: t.complexity for t in self.task_suite.duties}
       df['complexity'] = df['task_id'].map(task_complexity)
       complexity_perf = df.groupby(['agent_name', 'complexity'])['accuracy'].imply().unstack()
       complexity_perf.plot(variety='line', ax=axes[1, 1], marker='o', linewidth=2)
       axes[1, 1].set_title('Accuracy by Task Complexity', fontweight='daring')
       axes[1, 1].set_xlabel('Task Complexity')
       axes[1, 1].set_ylabel('Accuracy')
       axes[1, 1].legend(title='Agent', loc='greatest')
       axes[1, 1].grid(True, alpha=0.3)
       plt.tight_layout()
       plt.present()


if __name__ == "__main__":
   print("Enterprise Software Benchmarking for Agentic Agents")
   print("="*60)
   task_suite = EnterpriseTaskSuite()
   benchmark = BenchmarkEngine(task_suite)
   brokers = [RuleBasedAgent("Rule-Based Agent"), LLMAgent("LLM Agent"), HybridAgent("Hybrid Agent")]
   for agent in brokers:
       benchmark.run_benchmark(agent, iterations=3)
   results_df = benchmark.generate_report()
   benchmark.visualize_results(results_df)
   results_df.to_csv('agent_benchmark_results.csv', index=False)
   print("nResults exported to: agent_benchmark_results.csv")

We generate detailed experiences and create visible analytics for efficiency comparability. We analyze metrics akin to success price, execution time, and accuracy throughout brokers and activity complexities. Finally, we export the outcomes to CSV file, finishing a full enterprise-grade analysis workflow.

In conclusion, we carried out a sturdy, extensible benchmarking system that allows us to measure and evaluate the effectivity, adaptability, and accuracy of a number of agentic AI approaches. We noticed how completely different architectures excel at completely different ranges of activity complexity and how visible analytics spotlight efficiency tendencies. This course of allows us to consider current brokers and offers a sturdy basis for next-generation enterprise AI brokers, optimized for reliability and intelligence.

Check out the Full Codes here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks appeared first on MarkTechPost.