Securing the Voice Channel with Real‑Time Audio‑Native AI
This article is sponsored by Modulate and was written, edited, and revealed in alignment with our Emerj sponsored content guidelines. Learn extra about our thought management and content material creation companies on our Emerj Media Services page.
Live voice interactions involved facilities have grow to be a essential operational blind spot, the place fraud, identification danger, and agent attrition emerge in actual time with out corresponding visibility from enterprise methods.
Financial companies contact facilities are hemorrhaging cash from two instructions concurrently — and most enterprises are solely measuring one among them. The FBI’s Internet Crime Complaint Center reported that AI-driven fraud, together with voice cloning and deepfake impersonation, generated practically $893 million in verified losses in 2025 — the first 12 months the FBI formally tracked it as against the law class — representing simply the fraction of assaults that victims truly reported.
The penalties compound on the operational facet. The Society for Human Resource Management found that the common price to recruit and rent a single worker is almost $4,700 — earlier than coaching, ramp-up, or misplaced productiveness are factored in. In contact facilities, the place the Quality Assurance & Training Connection benchmarks annual agent turnover at 30 to 45%, that price repeats at scale, yearly, throughout each seat on the flooring. A 500-agent middle turning over at the business common is just not an HR drawback. It is a capital drawback.
The underlying concern is that contact facilities are working real-time, voice-based operations with no real-time intelligence about what is definitely taking place on these calls — whether or not an artificial voice is bypassing identification verification or an abusive caller is pushing a skilled agent towards the exit. Both losses are measurable. Neither is inevitable.
Emerj just lately hosted a 3‑half sequence on securing the voice channel for actual‑time danger, that includes Mike Pappas, CEO and Co‑Founder at Modulate; Ken Morino, Director of Market and Behavioral Research at Modulate; and Jon‑Rav Shende, Global CTO for Data and AI at Thales Group, analyzing how enterprises can detect fraud in‑name, deploy voice‑intelligence architectures that assist excessive‑stakes choices, and construct workflow‑stage governance that stands as much as regulators and insurers.
This article examines three essential insights on how enterprises can safe the voice channel because it turns into a frontline floor for fraud and excessive‑stakes choices:
- Voice channel as an actual‑time danger floor: Detecting fraud and manipulation throughout the name prevents monetary loss, regulatory publicity, and agent churn earlier than they escalate.
- Specialized voice‑intelligence structure for prime‑stakes choices: Models constructed for reside audio present the accuracy and velocity required for authentication, account adjustments, and fee approvals that generic AI can’t assist.
- Workflow‑stage governance and shared possession for voice‑AI outcomes: Clear escalation paths and audit‑prepared proof allow Security, Operations, and CX to behave on danger indicators in methods regulators and insurers can belief.
Voice Channel as a Real‑Time Risk Surface
Episode: Why Ensemble Architectures Win Against Real-Time Voice Risk – with Mike Pappas of Modulate
Guest: Mike Pappas, CEO & Co-Founder at Modulate
Expertise: AI, Conversational AI, AI Safety & Trust, Systems Architecture
Brief Recognition: Mike Pappas co-founded Modulate, the place he has led the growth and deployment of AI-powered conversational analytics utilized by Fortune 500 corporations and main gaming studios to handle harassment, fraud, and person security at scale. His prior expertise consists of technical and infrastructure roles at Lola and Bridgewater Associates, spanning machine studying, cloud methods, and software program structure. He additionally serves as a board member of the Family Online Safety Institute and holds a level in Physics and Applied Mathematics from MIT.
Mike Pappas describes a shift in how organizations want to know the voice channel. What was as soon as handled as a routine service interplay has grow to be a setting the place fraud, impersonation, and manipulation happen in actual time, usually quicker than current controls can detect.
The operational hole, in his view, is just not in detection functionality, however in timing — what occurs throughout the name versus what methods can observe afterward.
Pappas explains the hole immediately:
“The largest harms don’t present up in the logs — they occur whereas the name remains to be unfolding. By the time anybody critiques a transcript, the attacker has already succeeded. The actual danger is the hole between what’s taking place reside and what the group can truly see.”
— Mike Pappas, CEO & Co‑Founder, Modulate
Fraud makes an attempt more and more depend on urgency, emotional stress, and impersonation, which floor in the reside interplay itself. Because people reply to emotion earlier than coverage, these indicators affect choices earlier than conventional controls can intervene.
Pappas’ place is that detection should function on these behavioral cues as they happen — requiring fashions constructed to interpret the audio stream itself reasonably than the transcript.
Agents should not skilled to acknowledge adversarial conversational patterns, particularly when these patterns are scripted to bypass verification steps. Pappas argues that anticipating brokers to determine these indicators on their very own is unrealistic; the answer is to offer them actual‑time visibility into danger indicators so they aren’t counting on intuition in excessive‑stress moments.
In his framing, AI’s position is to persistently floor these indicators, even underneath time stress or when dealing with a convincing impersonation.
In his episode, Ken Morino notes that behavioral and emotional cues disappear when diminished to textual content, limiting the usefulness of transcript‑primarily based methods for detecting manipulation. The indicators that point out one thing is off — hesitation, tonal mismatch, conversational steering — are misplaced as soon as the interplay is flattened into phrases.
Morino’s view is that AI methods constructed for actual‑time audio can get well these indicators and current them in a type that matches into current workflows with out requiring brokers to interpret uncooked audio patterns themselves.
High‑stakes workflows similar to authentication, account adjustments, and fee approvals are uncovered as a result of choices should be made shortly, and attackers exploit that point stress.
Jon‑Rav Shende provides that deepfake fraud usually succeeds by exploiting workflow gaps and that almost all safety groups have restricted visibility into the reside interplay the place the compromise truly happens. His emphasis is on utilizing AI to floor in‑name indicators tied to identification danger, giving safety groups a view into the interplay whereas it’s nonetheless taking place reasonably than after the reality.
Across the three conversations, a number of answer patterns emerge:
- Surface danger indicators throughout the name, giving brokers actual‑time context reasonably than counting on intuition or reminiscence.
- Use audio‑native fashions that seize tone, hesitation, and emotional mismatch — indicators that don’t survive transcription.
- Expose workflow‑stage vulnerabilities in identification and approval processes the place attackers exploit velocity and ambiguity.
- Provide brokers with structured prompts or cues when danger indicators seem, thereby decreasing cognitive load throughout excessive‑stress interactions.
- Integrate safety visibility into reside interactions so groups don’t uncover compromises after the reality.
Specialized Voice‑Intelligence Architecture for High‑Stakes Decisions
Episode: Operationalizing Real-Time Voice Intelligence for FinServ and CX – with Ken Morino of Modulate
Expertise: Product Management, Behavioral Research, User Experience Design, Enterprise Software & Integrations
Brief Recognition: Ken Morino has led product and market analysis initiatives at Modulate, serving to form AI-driven conversational expertise and user-focused product technique. Prior to Modulate, he spent practically a decade at LiveShopper Sassie main enterprise product administration, API integrations, and large-scale consumer implementations, working with main company purchasers and cross-functional technical groups. Earlier in his profession, he held product, technical gross sales, and safety options management roles at Demarc Security, and he holds each a BS in Computer Science and an MA in Economics from UC Santa Barbara.
Ken Morino argues that almost all organizations try to resolve identification‑essential issues with methods that have been by no means designed for identification.
The dominant instruments in the market — ASR pipelines, transcript analytics, and generic LLMs — have been constructed for summarization, sentiment scoring, and compliance overview. They function on textual content, not audio, they usually assume that accuracy necessities are versatile. In authentication and account‑change workflows, these assumptions break instantly.
The technical constraints are non‑negotiable:
- Identity workflows have mounted latency budgets. A mannequin that takes 1.5 seconds to reply is unusable in a system that should approve or deny an motion in underneath 300 milliseconds.
- Transcript‑primarily based methods discard the acoustic options — pitch, timbre, micro‑pauses, harmonic construction — that identification methods depend upon.
- Generic LLMs can’t meet identification‑grade accuracy thresholds. A 95% correct mannequin is catastrophic when the remaining 5% is fraud.
- Single‑mannequin approaches fail as a result of no particular person sign (voiceprint, phrasing, metadata) is dependable sufficient to detect artificial audio.
- CX analytics methods lack multi‑sign fusion, which is required to mix acoustic, behavioral, and contextual indicators right into a defensible identification choice.
Morino summarizes the core limitation:
“Once you flatten a dialog into textual content, you lose the hesitation, the tone, the emotional mismatch — all the issues that let you know one thing isn’t proper.”
— Ken Morino, Director of Market and Behavioral Research, Modulate
Mike Pappas provides that identification‑essential choices require ensemble architectures — a number of specialised fashions working on totally different elements of the audio sign and converging on a single danger evaluation.
Jon‑Rav Shende notes that insurers and regulators more and more anticipate audit‑prepared proof that reveals how every sign contributed to the choice. Together, they view authentication, account adjustments, and fee approvals as requiring a goal‑constructed structure, not a repurposed analytics stack.
Workflow‑Level Governance and Shared Ownership for Voice‑AI Outcomes
Episode: Why Deepfake Fraud Beats Your Workflows, Not Your Technology – with Jon-Rav Shende of Thales Group
Guest: Jon-Rav Shende, Global CTO for Data and AI at Thales Group
Expertise: AI Security, Cloud & Enterprise Transformation, Cybersecurity & Risk Management, Data Governance & Trusted AI
Brief Recognition: Jon G. Shende has held senior expertise and safety management roles spanning CTO, CISO, and government advisory positions centered on AI, cybersecurity, and enterprise transformation. His expertise consists of management roles at Thales, Sutherland, and ForenSec Global, the place he led large-scale cloud, safety, and AI modernization initiatives for international enterprises, together with Fortune 500 organizations and multi-billion-dollar transformation applications. He additionally brings expertise with main expertise and consulting ecosystems, together with Ernst & Young and Cognizant, in addition to with cloud platforms similar to AWS, Azure, and Google, alongside energetic involvement with InfraGard and in depth work in AI governance, cyber resilience, and trusted AI adoption.
Jon‑Rav Shende’s contribution throughout the conversations is that the technical functionality to detect danger is simply half the drawback. The different half is organizational: as soon as a system can floor identification‑related indicators, the enterprise should determine who owns the response, how proof is captured, and the way choices grow to be defensible to regulators, auditors, and insurers.
In his view, the failure mode is not only missed detection; it’s unclear possession, inconsistent escalation, and the absence of audit‑prepared information that designate why an motion was taken:
“Organizations don’t fail as a result of the sign wasn’t there. They fail as a result of nobody is aware of who is meant to behave on it. If Security sees one thing however Operations owns the workflow, the alert dies in the center. And when one thing goes incorrect, there’s no file that reveals what was identified, when it was identified, and who made the choice. That’s what regulators search for, and that’s what insurers search for.”
— Jon‑Rav Shende, Global CTO for Data and AI at Thales Group
Ken Morino provides that governance additionally will depend on interpretability. A mannequin can detect a sign, but when the output is ambiguous or requires a specialist to decode, the group has not solved the drawback.
Ken’s view is that the system should current indicators in a type that matches into current workflows, as a result of the second an agent or analyst has to “determine” what the mannequin meant, accountability turns into unclear and choices grow to be inconsistent.
Mike Pappas reinforces this from a defensibility perspective. High‑stakes choices — authentication approvals, account adjustments, fee authorizations — should be explainable to regulators and insurers. That requires a shared operational mannequin: Security, Operations, and CX should agree on what constitutes danger, who owns the second when a sign seems, and the way the proof is captured. Without that alignment, organizations find yourself with fragmented visibility and no unified file of what occurred.
Across the episodes, three governance patterns emerge:
- Clear escalation paths that specify who owns the choice when a danger sign seems, and what authority they should pause, deny, or confirm an motion.
- Audit‑prepared proof trails that seize the indicators, the choice, and the rationale in a type regulators and insurers can consider.
- Cross‑purposeful alignment between Security, Operations, and CX in order that danger indicators don’t get trapped inside a single workforce’s workflow.
Shende’s view is that after AI begins influencing identification‑essential choices, the group should deal with these choices as shared property reasonably than departmental duties. The governance mannequin turns into as necessary as the mannequin structure, as a result of with out it, even the most correct system can’t produce outcomes that stand as much as scrutiny.
