Qualifire AI Releases Rogue: An End-to-End Agentic AI Testing Framework, Evaluating the Performance of AI Agents

Agentic methods are stochastic, context-dependent, and policy-bounded. Conventional QA—unit exams, static prompts, or scalar “LLM-as-a-judge” scores—fails to show multi-turn vulnerabilities and gives weak audit trails. Developer groups want protocol-accurate conversations, express coverage checks, and machine-readable proof that may gate releases with confidence.
Qualifire AI has open-sourced Rogue, a Python framework that evaluates AI brokers over the Agent-to-Agent (A2A) protocol. Rogue converts enterprise insurance policies into executable eventualities, drives multi-turn interactions in opposition to a goal agent, and outputs deterministic stories appropriate for CI/CD and compliance evaluations.
Quick Start
Prerequisites
- uvx – If not put in, comply with uv installation guide
- Python 3.10+
- An API key for an LLM supplier (e.g., OpenAI, Google, Anthropic).
Installation
Option 1: Quick Install (Recommended)
Use our automated set up script to stand up and operating shortly:
# TUI
uvx rogue-ai
# Web UI
uvx rogue-ai ui
# CLI / CI/CD
uvx rogue-ai cli
Option 2: Manual Installation
(a) Clone the repository:
git clone https://github.com/qualifire-dev/rogue.git
cd rogue
(b) Install dependencies:
If you might be utilizing uv:
uv sync
Or, if you’re utilizing pip:
pip set up -e .
(c) OPTIONALLY: Set up your setting variables: Create a .env file in the root listing and add your API keys. Rogue makes use of LiteLLM, so you possibly can set keys for numerous suppliers.
OPENAI_API_KEY="sk-..."
ANTHROPIC_API_KEY="sk-..."
GOOGLE_API_KEY="..."
Running Rogue
Rogue operates on a client-server structure the place the core analysis logic runs in a backend server, and numerous shoppers connect with it for various interfaces.
Default Behavior
When you run uvx rogue-ai with none mode specified, it:
- Starts the Rogue server in the background
- Launches the TUI (Terminal User Interface) consumer
uvx rogue-ai
Available Modes
- Default (Server + TUI): uvx rogue-ai – Starts server in background + TUI consumer
- Server: uvx rogue-ai server – Runs solely the backend server
- TUI: uvx rogue-ai tui – Runs solely the TUI consumer (requires server operating)
- Web UI: uvx rogue-ai ui – Runs solely the Gradio net interface consumer (requires server operating)
- CLI: uvx rogue-ai cli – Runs non-interactive command-line analysis (requires server operating, splendid for CI/CD)
Mode Arguments
Server Mode
uvx rogue-ai server [OPTIONS]
Options:
- –host HOST – Host to run the server on (default: 127.0.0.1 or HOST env var)
- –port PORT – Port to run the server on (default: 8000 or PORT env var)
- –debug – Enable debug logging
TUI Mode
uvx rogue-ai tui [OPTIONS]
Web UI Mode
uvx rogue-ai ui [OPTIONS]
Options:
- –rogue-server-url URL – Rogue server URL (default: http://localhost:8000)
- –port PORT – Port to run the UI on
- –workdir WORKDIR – Working listing (default: ./.rogue)
- –debug – Enable debug logging
Example: Testing the T-Shirt Store Agent
This repository features a easy instance agent that sells T-shirts. You can use it to see Rogue in motion.
Install instance dependencies:
If you might be utilizing uv:
uv sync --group examples
or, if you’re utilizing pip:
pip set up -e .[examples]
(a) Start the instance agent server in a separate terminal:
If you might be utilizing uv:
uv run examples/tshirt_store_agent
If not:
python examples/tshirt_store_agent
This will begin the agent on http://localhost:10001.
(b) Configure Rogue in the UI to level to the instance agent:
- Agent URL: http://localhost:10001
- Authentication: no-auth
(c) Run the analysis and watch Rogue take a look at the T-Shirt agent’s insurance policies!
You can use both the TUI (uvx rogue-ai) or Web UI (uvx rogue-ai ui) mode.
Where Rogue Fits: Practical Use Cases
- Safety & Compliance Hardening: Validate PII/PHI dealing with, refusal conduct, secret-leak prevention, and regulated-domain insurance policies with transcript-anchored proof.
- E-Commerce & Support Agents: Enforce OTP-gated reductions, refund guidelines, SLA-aware escalation, and tool-use correctness (order lookup, ticketing) beneath adversarial and failure situations.
- Developer/DevOps Agents: Assess code-mod and CLI copilots for workspace confinement, rollback semantics, rate-limit/backoff conduct, and unsafe command prevention.
- Multi-Agent Systems: Verify planner
executor contracts, functionality negotiation, and schema conformance over A2A; consider interoperability throughout heterogeneous frameworks.
- Regression & Drift Monitoring: Nightly suites in opposition to new mannequin variations or immediate modifications; detect behavioral drift and implement policy-critical move standards earlier than launch.
What Exactly Is Rogue—and Why Should Agent Dev Teams Care?
Rogue is an end-to-end testing framework designed to guage the efficiency, compliance, and reliability of AI brokers. Rogue synthesizes enterprise context and threat into structured exams with clear goals, ways and success standards. The EvaluatorAgent runs protocol appropriate conversations in quick single flip or deep multi flip adversarial modes. Bring your personal mannequin, or let Rogue use Qualifire’s bespoke SLM judges to drive the exams. Streaming observability and deterministic artifacts: dwell transcripts,move/fail verdicts, rationales tied to transcript spans, timing and mannequin/model lineage.
Under the Hood: How Rogue Is Built
Rogue operates on a client-server structure:
- Rogue Server: Contains the core analysis logic
- Client Interfaces: Multiple interfaces that connect with the server:
- TUI (Terminal UI): Modern terminal interface constructed with Go and Bubble Tea
- Web UI: Gradio-based net interface
- CLI: Command-line interface for automated analysis and CI/CD
This structure permits for versatile deployment and utilization patterns, the place the server can run independently and a number of shoppers can connect with it concurrently.
Summary
Rogue helps developer groups take a look at agent conduct the method it truly runs in manufacturing. It turns written insurance policies into concrete eventualities, workouts these eventualities over A2A, and information what occurred with transcripts you possibly can audit. The result’s a transparent, repeatable sign you should utilize in CI/CD to catch coverage breaks and regressions earlier than they ship.
Thanks to the Qualifire workforce for the thought management/ Resources for this text. Qualifire workforce has supported this content material/article.
The publish Qualifire AI Releases Rogue: An End-to-End Agentic AI Testing Framework, Evaluating the Performance of AI Agents appeared first on MarkTechPost.