Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>The most impressive part is that the voice uses the right feelings and tonal language during the presentation.

Consequences of audio2audio (rather than audio >text text>audio). Being able to manipulate speech nearly as well as it manipulates text is something else. This will be a revelation for language learning amongst other things. And you can interrupt it freely now!



Anyone who has used elevenlabs for voice generation has found this to be the case. Voice to voice seems like magic.


Elevenlabs isn’t remotely close to how good this voice sounds. I’ve tried to use it extensively before and it just isn’t natural. This voice from openAI and even the one chatGPT has been using is natural.


When have you last used it. I used a few weeks ago to create a fake podcast as a side project recently and it sounded pretty good with their highest end model with cranked up tunings.


About 3 months ago for that exact use case.


My point isn’t necessarily elevenlabs being good or bad, it’s the difference between its text to voice and voice to voice generations. The latter is incredibly expressive and just shows how much is lacking in our ability to encode inflection in text.


However, this looks like it only works with speech - i.e. you can't ask it, "What's the tune I'm humming?" or "Why is my car making this noise?"

I could be wrong but I haven't seen any non-speech demos.


Fwiw, the live demo[0] included different kinds of breathing, and getting feedback on it.

[0]: https://youtu.be/DQacCB9tDaw?t=557


What about the breath analysis?


I did see that, though my interpretation is that breathing is included in its voice tokenizer which helps it understand emotions in speech (the AI can generate breath sounds after all). Other sounds, like bird songs or engine noises, may not work - but I could be wrong.


I suspect that like images and video, their audio system is or will become more general purpose. For example it can generate the sound of coins falling onto a table.


allegedly google assistant can do the "humming" one but i have never gotten it to work. I wish it would because sometimes i have a song stuck in my head that i know is sampled from another song.


I asked it to make a bird noise, instead it told me what a bird sounds like with words. True audio to audio should be able to be any noise, a trombone, traffic, a crashing sea, anything. Maybe there is a better prompt there but it did not seem like it.


The new voice mode has not rolled out yet. It's rolling out to plus users in the next couple weeks.

Also it's possible this is trained on mostly speech.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: