|

Google Introduces Speech-to-Retrieval (S2R) Approach that Maps a Spoken Query Directly to an Embedding and Retrieves Information without First Converting Speech to Text

Google AI Research group has introduced a manufacturing shift in Voice Search by introducing Speech-to-Retrieval (S2R). S2R maps a spoken question instantly to an embedding and retrieves info without first changing speech to textual content. The Google group positions S2R as an architectural and philosophical change that targets error propagation within the traditional cascade modeling method and focuses the system on retrieval intent moderately than transcript constancy. Google analysis group states Voice Search is now powered by S2R.

https://analysis.google/weblog/speech-to-retrieval-s2r-a-new-approach-to-voice-search/

From cascade modeling to intent-aligned retrieval

In the standard cascade modeling method, computerized speech recognition (ASR) first produces a single textual content string, which is then handed to retrieval. Small transcription errors can change question that means and yield incorrect outcomes. S2R reframes the issue across the query “What info is being sought?” and bypasses the delicate intermediate transcript.

Evaluating the potential of S2R

Google’s analysis group analyzed the disconnect between phrase error charge (WER) (ASR high quality) and imply reciprocal rank (MRR) (retrieval high quality). Using human-verified transcripts to simulate a cascade groundtruth “good ASR” situation, the group in contrast (i) Cascade ASR (real-world baseline) vs (ii) Cascade groundtruth (higher sure) and noticed that decrease WER doesn’t reliably predict increased MRR throughout languages. The persistent MRR hole between the baseline and groundtruth signifies room for fashions that optimize retrieval intent instantly from audio.

https://analysis.google/weblog/speech-to-retrieval-s2r-a-new-approach-to-voice-search/

Architecture: dual-encoder with joint coaching

At the core of S2R is a dual-encoder structure. An audio encoder converts the spoken question into a wealthy audio embedding that captures semantic that means, whereas a doc encoder generates a corresponding vector illustration for paperwork. The system is educated with paired (audio question, related doc) information so that the vector for an audio question is geometrically shut to vectors of its corresponding paperwork within the illustration house. This coaching goal instantly aligns speech with retrieval targets and removes the brittle dependency on precise phrase sequences.

Serving path: streaming audio, similarity search, and rating

At inference time, the audio is streamed to the pre-trained audio encoder to produce a question vector. This vector is used to effectively establish a extremely related set of candidate outcomes from Google’s index; the search rating system—which integrates a whole bunch of alerts—then computes the ultimate order. The implementation preserves the mature rating stack whereas changing the question illustration with a speech-semantic embedding.

Evaluating S2R on SVQ

On the Simple Voice Questions (SVQ) analysis, the publish presents a comparability of three methods: Cascade ASR (blue), Cascade groundtruth (inexperienced), and S2R (orange). The S2R bar considerably outperforms the baseline Cascade ASR and approaches the higher sure set by Cascade groundtruth on MRR, with a remaining hole that the authors word as future analysis headroom.

Open sources: SVQ and the Massive Sound Embedding Benchmark (MSEB)

To assist neighborhood progress, Google open-sourced Simple Voice Questions (SVQ) on Hugging Face: brief audio questions recorded in 26 locales throughout 17 languages and below a number of audio circumstances (clear, background speech noise, visitors noise, media noise). The dataset is launched as an undivided analysis set and is licensed CC-BY-4.0. SVQ is a part of the Massive Sound Embedding Benchmark (MSEB), an open framework for assessing sound embedding strategies throughout duties.

Key Takeaways

  • Google has moved Voice Search to Speech-to-Retrieval (S2R), mapping spoken queries to embeddings and skipping transcription.
  • Dual-encoder design (audio encoder + doc encoder) aligns audio/question vectors with doc embeddings for direct semantic retrieval.
  • In evaluations, S2R outperforms the manufacturing ASR→retrieval cascade and approaches the ground-truth transcript higher sure on MRR.
  • S2R is stay in manufacturing and serving a number of languages, built-in with Google’s present rating stack.
  • Google launched Simple Voice Questions (SVQ) (17 languages, 26 locales) below MSEB to standardize speech-retrieval benchmarking.

Editorial Comments

Speech-to-Retrieval (S2R) is a significant architectural correction moderately than a beauty improve: by changing the ASR→textual content hinge with a speech-native embedding interface, Google aligns the optimization goal with retrieval high quality and removes a main supply of cascade error. The manufacturing rollout and multilingual protection matter, however the attention-grabbing work now could be operational—calibrating audio-derived relevance scores, stress-testing code-switching and noisy circumstances, and quantifying privateness trade-offs as voice embeddings turn into question keys.


Check out the Technical details here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t overlook to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish Google Introduces Speech-to-Retrieval (S2R) Approach that Maps a Spoken Query Directly to an Embedding and Retrieves Information without First Converting Speech to Text appeared first on MarkTechPost.

Similar Posts