Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Despite lots of Internet talk about text to speech, there's still no really amazing TTS that you can pay money for and use. They all sound like text to speech.


There is, just not for English.

Here, take a look at this snippet:

https://youtu.be/eEXvMOJ9ps0?t=66

They play 4 clips, 2 of them human, 2 of them AI generated. Can you tell which ones are which?

And the kicker is, this works in real-time (AFAIR), and it doesn't even use the GPU (it's CPU-only), and generates pitch-correct speech (for Japanese). It's not even funny how far ahead they are. And you can buy it right now.

AFAIK they use some sort of a hybrid method with a bunch of custom modeling DSP code around it (they've been doing speech synthesis for over a decade) plus a neural network. One mistake that essentially all of the western TTS models seem to make is that they use only a neural network, without augmenting it with non neural network code, which (from what I can see) is the secret sauce to make a fast and good sounding TTS work.


I can't, but I don't speak Japanese. Can a fluent Japanese speaker really not tell?


Yes, but can it be used to express emotions? Can it derive emotions from the text alone, without the painstaking guidance? That seems to be the main culprit with existing TTS engines; neutral tone can be generated relatively well.


Honestly I suspect some kind of "emotional markdown" would be more useful if it has a light and intuitive syntax


Humans can't drive emotions from text alone without external contextual clues.

Such has been the source of much miscommunication online.

Amateur fiction writing, which tends to overemphasize how things are said ("I guess I can go rescue your cat", the exasperated detective said wearily) might be easier for AI!


Sure they can, that's the whole point of acting. Also anyone who ever read a story to a kid can infer emotions from the text itself.


To a limited extent, sure, but kids books are also written to be very emotive.

The linked page actually has examples of the same text being read with different emotions, demonstrating that for even a single sentence a lot of variance is possible.


This looks pretty good: https://play.ht/

It was used to do the fake joe rogan/steve jobs podcast: https://podcast.ai/


Natural-sounding TTS models that require additional work (not entirely automatic) exist for quite a while. Obsidian used Sonantic for Outer Worlds (an AA game) in 2019, and the dialogues sounded like they were voiced by real actors.


I think many heavy TTS users (including myself) slowly train to use higher speeds after which point nothing sounds particularly natural. What I want is trained speech models that remain coherent at high speeds (over 3x). Even better if there's bi/multi lingual models that can seemelessly switch between languages.


At the moment the best is probably Google WaveNet/Neural2, you can try it here: https://cloud.google.com/text-to-speech

You can use the API to read books/articles aloud in real-time, but it is quite expensive after the free trial.


You should try murf dot ai. It is pretty realistic. Completely blows Amazon Polly and Google's TTS out of water.


I tried it and it sounded like TTS.

If I tried the wrong thing can you provide a link? I’d like to be amazed.


Check out Descript's Overdub - it is pretty amazing: https://www.descript.com/overdub


It's free, but I find the "Read aloud" feature in Microsoft Edge to be extremely natural sounding. Try using it to read this comment!


Checkout https://resemble.ai

I used to work there; great team behind the product!


Same feeling here. I would love to listen to some of my bookmarked articles in a better, well punctuated/stressed voice.


Have you looked at the demos for tortoise tts? It's even free. It's not real-time however.


NaturalReaders - the premium voices




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: