Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Real-Time Use that Begins Speaking from the First Word
Real-time brokers, dwell dubbing, and simultaneous translation die by a thousand milliseconds. Most “streaming” TTS (Text to Speech) stacks nonetheless wait for a bit of textual content earlier than they emit sound, so the human hears a beat of silence earlier than the voice begins. VoXtream—launched by KTH’s Speech, Music and Hearing group—assaults this head-on:…