If you haven’t checked out Microsoft’s Vibe Voice then you should. Its an open-source, frontier voice model that supports both speech to text and text to speech. Its open-source, in that you can run the model locally (or anywhere) which is super compelling because it removes the reliance on needed to make calls to a hosted model that could be slower and often costlier.

The VibeVoice-Realtime offers 300ms latency, small enough for some edge devices making it ideal live assistants, demos etc.

It can process 60minutes of audio for speech recognition and upto 90minutes for text to speech all in one pass which is pretty neat.

It also does a good job when there are multiple speakers (think interviews, podcasts, meetings) and provides a rich transcript.