Despite lots of Internet talk about text to speech, there's still no really amazing TTS that you can pay money for and use. They all sound like text to speech.
They play 4 clips, 2 of them human, 2 of them AI generated. Can you tell which ones are which?
And the kicker is, this works in real-time (AFAIR), and it doesn't even use the GPU (it's CPU-only), and generates pitch-correct speech (for Japanese). It's not even funny how far ahead they are. And you can buy it right now.
AFAIK they use some sort of a hybrid method with a bunch of custom modeling DSP code around it (they've been doing speech synthesis for over a decade) plus a neural network. One mistake that essentially all of the western TTS models seem to make is that they use only a neural network, without augmenting it with non neural network code, which (from what I can see) is the secret sauce to make a fast and good sounding TTS work.
Yes, but can it be used to express emotions? Can it derive emotions from the text alone, without the painstaking guidance? That seems to be the main culprit with existing TTS engines; neutral tone can be generated relatively well.
Humans can't drive emotions from text alone without external contextual clues.
Such has been the source of much miscommunication online.
Amateur fiction writing, which tends to overemphasize how things are said ("I guess I can go rescue your cat", the exasperated detective said wearily) might be easier for AI!
To a limited extent, sure, but kids books are also written to be very emotive.
The linked page actually has examples of the same text being read with different emotions, demonstrating that for even a single sentence a lot of variance is possible.
Natural-sounding TTS models that require additional work (not entirely automatic) exist for quite a while. Obsidian used Sonantic for Outer Worlds (an AA game) in 2019, and the dialogues sounded like they were voiced by real actors.
I think many heavy TTS users (including myself) slowly train to use higher speeds after which point nothing sounds particularly natural. What I want is trained speech models that remain coherent at high speeds (over 3x). Even better if there's bi/multi lingual models that can seemelessly switch between languages.