Hacker Newsnew | past | comments | ask | show | jobs | submit | java_beyb's commentslogin

i haven't them recently but in the past I found smaller companies are much better at customization.

try assemblyai, deepgram, picovoice or speechmatics. picovoice is on-device, you gotta fine-tune the model, but it's pretty easy as it gives you pronunciation recommendations, and you can run them serverless. https://picovoice.ai/docs/leopard/#add-custom-vocabulary the others do it through an API call and you gotta find your own pronunciation: https://docs.speechmatics.com/features/custom-dictionary

if you wanna go with whisper you can use picoVoice falcon or pyannote for speaker diarization: https://picovoice.ai/blog/falcon-whisper-integration/ https://github.com/yinruiqing/pyannote-whisper


they have been around since pre chatgpt era, and not relevant to chatgpt. chatgpt understands text, coqui reads text. they're a text-to-speech / voice cloning/voice generation company.

their founders were working at Mozilla when Mozilla sunset its speech products. They founded coqui and released text-to-speech models based on their previous work. They tried to do a mix of open-source and closed-source stuff and introduced weird licensing concept that I never understood. there are many mediocre to good open-source TTS alternatives. they might have struggled to differentiate.

it's sad that people complain about big tech dominance, yet don't pay startups. i dont know the founders but i suspect they will join a big tech company as most of the Mozilla team did and we'll keep feeding Amazon, Google and Microsoft.


what's advanced speech recognition?


well, deepgram might be the fastest among cloud-dependent APIs, like Speechmatics and Assembly AI mentioned above. -but- it cannot be faster than local or smaller models as you mentioned.

Among local solutions, Whisper SDK doesn't support streaming, I haven't seen any good workarounds or successfully implemented it. VOSK, DeepSpeech, Kaldi, et al were good once upon a time... Picovoice seems to be doing well.

I was planning to work on this: https://picovoice.ai/blog/chatgpt-ai-virtual-assistant-in-py... using Eleven Labs and Cheetah. Hope I can crave some time


unless i'm misunderstanding `whisper.cpp` seems to support streaming & the repository includes a native example[0] and a WASM example[1] with a demo site[2].

[0]: https://github.com/ggerganov/whisper.cpp/tree/master/example...

[1]: https://github.com/ggerganov/whisper.cpp/blob/master/example...

[2]: https://whisper.ggerganov.com/stream/


have you tried it? i mean for fun, it wouldn't hurt for sure and ggerganov is doing amazing stuff. kudos to him.

but whisper is designed to process audio files in 30-second batches if I'm not mistaken. it's been a while since whisper released, lol. These workarounds make the window smaller but it doesn't change the fact that they're workarounds. you can adjust, modify, or manipulate the model. You can't write or train it from scratch. check out the issues referring to the real-time transcription in the repo.

can you use it? yes would it perform better than Deepgram? -although it's an API and probably not the best API- I am not sure. would i use it in my money-generating application? absolutely not.


how does it 4x cheaper than amazon or google?

your basic plan cost per 1M char is $16, so are Google and Amazon.


what you're looking for is called diarization. almost all enterprise STTs do that, you can find individual libraries on GitHub too.

fine-tuning whisper is a nightmare, I don't know what the interviews are for, but again most enterprise STTs offer customization. you can add medical terminology.

---Google, Amazon and Nuance have medical models but either expensive or not available for personal projects.


Thanks for that! Searching for diarization really helped me narrow down for what I was looking for.


edge brings compute close to where data is generated, cloud brings data to compute.

even processing something in a web browser is called edge. i guess due to this impression the industry is moving towards "on-device"


first, good initiative! thanks for sharing. i think you gotta be more diligent and careful with the problem statement.

checking the weather in Sofia, Bulgaria requires cloud, current information. it's not "random speech". ESP SR capability issues don't mean that you cannot process it locally.

the comment was on "voice processing" i.e. sending speech to the cloud, not sending a call request to get the weather information.

besides, local intent detection, beyond 400 commands, there are great local STT options, working better than most cloud STTs for "random speech"

https://github.com/alphacep/vosk-api https://picovoice.ai/platform/cheetah/


Thanks!

There are at least two things here:

1) The ability to do speech to text on random speech. I'm going to stick by that description :). If you've ever watched a little kid play with Alexa it's definitely what you would call "random speech" haha!

2) The ability to satisfy the request (intent) of the text output. Up to and including current information via API, etc.

Our soon to be released highly optimized open source inference server uses Whisper and is ridiculously fast and accurate. Based on our testing with nieces and nephews we have "random speech" covered :). Our inference server also supports LLaMA, Vicuna, etc and can chain together STT -> LLM/API/etc -> TTS - with the output simply played over the Willow speaker and/or displayed on the LCD.

Our goal is to make a Willow Home Assistant component that assists with #2. There are plenty of HA integrations and components to do things like get weather in real time, in addition to satisfying user intent recognition. They have an entire platform for it[0]. Additionally, we will make our inference server implementation (that does truly unique things for Willow) available as just another TTS/STT integration option on top of the implementations they already support so you can use whatever you want, or send the audio output after wake to whatever you want like Vosk, Cheetah, etc, etc.

[0] - https://developers.home-assistant.io/docs/intent_index/


if your decision is cost-oriented, then Whisper API is the cheapest - at least based on what other API companies promote on their websites.

however, depending on what you're building, you may consider local speech-to-text by running speech-to-text on user's devices, basically you do not pay for the cloud.

you should understand whether you'll need model adaptation -like adding custom industry jargon or so. whisper might be challenging.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: