Salesforce AI Research Releases VoiceAgentRAG: A Dual-Agent Memory Router that Cuts Voice RAG Retrieval Latency by 316x

In the world of voice AI, the distinction between a useful assistant and a clumsy interplay is measured in milliseconds. While text-based Retrieval-Augmented Generation (RAG) techniques can afford a number of seconds of ‘considering’ time, voice brokers should reply inside a 200ms price range to keep up a pure conversational movement. Standard manufacturing vector database queries sometimes add 50-300ms of community latency, successfully consuming your entire price range earlier than an LLM even begins producing a response.

Salesforce AI analysis staff has launched VoiceAgentRAG, an open-source dual-agent structure designed to bypass this retrieval bottleneck by decoupling doc fetching from response technology.

The Dual-Agent Architecture: Fast Talker vs. Slow Thinker

VoiceAgentRAG operates as a reminiscence router that orchestrates two concurrent brokers by way of an asynchronous occasion bus:

The Fast Talker (Foreground Agent): This agent handles the crucial latency path. For each consumer question, it first checks a neighborhood, in-memory Semantic Cache. If the required context is current, the lookup takes roughly 0.35ms. On a cache miss, it falls again to the distant vector database and instantly caches the outcomes for future turns.
The Slow Thinker (Background Agent): Running as a background process, this agent constantly screens the dialog stream. It makes use of a sliding window of the final six dialog turns to foretell 3–5 possible follow-up subjects. It then pre-fetches related doc chunks from the distant vector retailer into the native cache earlier than the consumer even speaks their subsequent query.

To optimize search accuracy, the Slow Thinker is instructed to generate document-style descriptions moderately than questions. This ensures the ensuing embeddings align extra intently with the precise prose discovered within the information base.

The Technical Backbone: Semantic Caching

The system’s effectivity hinges on a specialised semantic cache carried out with an in-memory FAISS IndexFlat IP (internal product).

Document-Embedding Indexing: Unlike passive caches that index by question which means, VoiceAgentRAG indexes entries by their very own doc embeddings. This permits the cache to carry out a correct semantic search over its contents, making certain relevance even when the consumer’s phrasing differs from the system’s predictions.
Threshold Management: Because query-to-document cosine similarity is systematically decrease than query-to-query similarity, the system makes use of a default threshold of $tau = 0.40$ to stability precision and recall.
Maintenance: The cache detects near-duplicates utilizing a 0.95 cosine similarity threshold and employs a Least Recently Used (LRU) eviction coverage with a 300-second Time-To-Live (TTL).
Priority Retrieval: On a Fast Talker cache miss, a PriorityRetrieval occasion triggers the Slow Thinker to carry out a direct retrieval with an expanded top-k (2x the default) to quickly populate the cache across the new subject space.

Benchmarks and Performance

The analysis staff evaluated the system utilizing Qdrant Cloud as a distant vector database throughout 200 queries and 10 dialog eventualities.

Metric	Performance
Overall Cache Hit Rate	75% (79% on heat turns)
Retrieval Speedup	316x $(110ms rightarrow 0.35ms)$
Total Retrieval Time Saved	16.5 seconds over 200 turns

The structure is simplest in topically coherent or sustained-topic eventualities. For instance, ‘Feature comparability’ (S8) achieved a 95% hit fee. Conversely, efficiency dipped in additional unstable eventualities; the lowest-performing situation was ‘Existing buyer improve’ (S9) at a 45% hit fee, whereas ‘Mixed rapid-fire’ (S10) maintained 55%.

Integration and Support

The VoiceAgentRAG repository is designed for broad compatibility throughout the AI stack:

LLM Providers: Supports OpenAI, Anthropic, Gemini/Vertex AI, and Ollama. The paper’s default analysis mannequin was GPT-4o-mini.
Embeddings: The analysis utilized OpenAI text-embedding-3-small (1536 dimensions), however the repository offers assist for each OpenAI and Ollama embeddings.
STT/TTS: Supports Whisper (native or OpenAI) for speech-to-text and Edge TTS or OpenAI for text-to-speech.
Vector Stores: Built-in assist for FAISS and Qdrant.

Key Takeaways

Dual-Agent Architecture: The system solves the RAG latency bottleneck by utilizing a foreground ‘Fast Talker’ for sub-millisecond cache lookups and a background ‘Slow Thinker’ for predictive pre-fetching.
Significant Speedup: It achieves a 316x retrieval speedup $(110ms rightarrow 0.35ms)$ on cache hits, which is crucial for staying inside the pure 200ms voice response price range.
High Cache Efficiency: Across numerous eventualities, the system maintains a 75% total cache hit fee, peaking at 95% in topically coherent conversations like characteristic comparisons.
Document-Indexed Caching: To guarantee accuracy no matter consumer phrasing, the semantic cache indexes entries by doc embeddings moderately than the expected question’s embedding.
Anticipatory Prefetching: The background agent makes use of a sliding window of the final 6 dialog turns to foretell possible follow-up subjects and populate the cache throughout pure inter-turn pauses.

Check out the Paper and Repo here. Also, be happy to observe us on Twitter and don’t overlook to affix our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up Salesforce AI Research Releases VoiceAgentRAG: A Dual-Agent Memory Router that Cuts Voice RAG Retrieval Latency by 316x appeared first on MarkTechPost.

Salesforce AI Research Releases VoiceAgentRAG: A Dual-Agent Memory Router that Cuts Voice RAG Retrieval Latency by 316x