they have been around since pre chatgpt era, and not relevant to chatgpt. chatgpt understands text, coqui reads text. they're a text-to-speech / voice cloning/voice generation company.
their founders were working at Mozilla when Mozilla sunset its speech products. They founded coqui and released text-to-speech models based on their previous work. They tried to do a mix of open-source and closed-source stuff and introduced weird licensing concept that I never understood. there are many mediocre to good open-source TTS alternatives. they might have struggled to differentiate.
it's sad that people complain about big tech dominance, yet don't pay startups. i dont know the founders but i suspect they will join a big tech company as most of the Mozilla team did and we'll keep feeding Amazon, Google and Microsoft.
well, deepgram might be the fastest among cloud-dependent APIs, like Speechmatics and Assembly AI mentioned above. -but- it cannot be faster than local or smaller models as you mentioned.
Among local solutions,
Whisper SDK doesn't support streaming, I haven't seen any good workarounds or successfully implemented it.
VOSK, DeepSpeech, Kaldi, et al were good once upon a time...
Picovoice seems to be doing well.
unless i'm misunderstanding `whisper.cpp` seems to support streaming & the repository includes a native example[0] and a WASM example[1] with a demo site[2].
have you tried it?
i mean for fun, it wouldn't hurt for sure and ggerganov is doing amazing stuff. kudos to him.
but whisper is designed to process audio files in 30-second batches if I'm not mistaken. it's been a while since whisper released, lol. These workarounds make the window smaller but it doesn't change the fact that they're workarounds. you can adjust, modify, or manipulate the model. You can't write or train it from scratch. check out the issues referring to the real-time transcription in the repo.
can you use it? yes
would it perform better than Deepgram? -although it's an API and probably not the best API- I am not sure.
would i use it in my money-generating application? absolutely not.
what you're looking for is called diarization. almost all enterprise STTs do that, you can find individual libraries on GitHub too.
fine-tuning whisper is a nightmare, I don't know what the interviews are for, but again most enterprise STTs offer customization. you can add medical terminology.
---Google, Amazon and Nuance have medical models but either expensive or not available for personal projects.
first, good initiative! thanks for sharing. i think you gotta be more diligent and careful with the problem statement.
checking the weather in Sofia, Bulgaria requires cloud, current information. it's not "random speech". ESP SR capability issues don't mean that you cannot process it locally.
the comment was on "voice processing" i.e. sending speech to the cloud, not sending a call request to get the weather information.
besides, local intent detection, beyond 400 commands, there are great local STT options, working better than most cloud STTs for "random speech"
1) The ability to do speech to text on random speech. I'm going to stick by that description :). If you've ever watched a little kid play with Alexa it's definitely what you would call "random speech" haha!
2) The ability to satisfy the request (intent) of the text output. Up to and including current information via API, etc.
Our soon to be released highly optimized open source inference server uses Whisper and is ridiculously fast and accurate. Based on our testing with nieces and nephews we have "random speech" covered :). Our inference server also supports LLaMA, Vicuna, etc and can chain together STT -> LLM/API/etc -> TTS - with the output simply played over the Willow speaker and/or displayed on the LCD.
Our goal is to make a Willow Home Assistant component that assists with #2. There are plenty of HA integrations and components to do things like get weather in real time, in addition to satisfying user intent recognition. They have an entire platform for it[0]. Additionally, we will make our inference server implementation (that does truly unique things for Willow) available as just another TTS/STT integration option on top of the implementations they already support so you can use whatever you want, or send the audio output after wake to whatever you want like Vosk, Cheetah, etc, etc.
if your decision is cost-oriented, then Whisper API is the cheapest - at least based on what other API companies promote on their websites.
however, depending on what you're building, you may consider local speech-to-text by running speech-to-text on user's devices, basically you do not pay for the cloud.
you should understand whether you'll need model adaptation -like adding custom industry jargon or so. whisper might be challenging.
try assemblyai, deepgram, picovoice or speechmatics. picovoice is on-device, you gotta fine-tune the model, but it's pretty easy as it gives you pronunciation recommendations, and you can run them serverless. https://picovoice.ai/docs/leopard/#add-custom-vocabulary the others do it through an API call and you gotta find your own pronunciation: https://docs.speechmatics.com/features/custom-dictionary
if you wanna go with whisper you can use picoVoice falcon or pyannote for speaker diarization: https://picovoice.ai/blog/falcon-whisper-integration/ https://github.com/yinruiqing/pyannote-whisper