I built something similar for Linux (yapyap — push-to-talk with whisper.cpp). The "local is too slow" argument doesn't hold up anymore if you have any GPU at all. whisper large-v3-turbo with CUDA on an RTX card transcribes a full paragraph in under a second. Even on CPU, parakeet is near-instant for short utterances.The "deep context" feature is clever, but screenshotting and sending to a cloud LLM feels like massive overkill for fixing name spelling. The accessibility API approach someone mentioned upthread is the right call — grab the focused field's content, nearby labels, window title. That's a tiny text prompt a 3B local model handles in milliseconds. No screenshots, no cloud, no latency.The real question with Groq-dependent tools: what happens when the free tier goes away? We've seen this movie before. Building on local models is slower today but doesn't have a rug-pull failure mode.
Yeah local works really fine. I tried this other tool: https://github.com/KoljaB/RealtimeVoiceChat which allows you to live chat with a (local) LLM. With local whisper and local LLM (8b llama in my case) it works phenomenally and it responds so quickly that it feels like it's interrupting me.
Too bad that tool no longer seems to be developed. Looking for something similar. But it's really nice to see what's possible with local models.
My M1 16GB Mini and M2 16GB Air both deliver insane local transcription performance without eating up much memory - I think the M line + Parakeet delivers insane local performance and you get privacy for free
Yeah, that model is amazing. It even runs reasonably well on my mid-range Android phone with this quite simple but very useful application, as long as you don't speak for too long or interrupt yourself for transcribing every once in a while. I do have handy.computer on my Mac too.
I find the model works surprisingly well and in my opinion surpasses all other models I've tried. Finally a model that can mostly understand my not-so-perfect English and handle language switching mid sentence (compare that to Gemini's voice input, which is literally THE WORST, always trying to transcribe in the wrong language and even if the language is correct produces the uttermost crap imaginable).
Ack for dictations but Gemini voice is fun for interactive voice experiments -> https://hud.arach.dev/ honestly blown away by how much Gemini could assist with with basically no dev effort
FWIW whisper.cpp with the default model works at 6x realtime transcription speed on my four-core ~2.4GHz laptop, and doesn't really stress CPU or memory. This is for batch transcribing podcasts.
The downside is that couldn't get it to segment for different speakers. The concensus seemed to be to use a separate tool.
Thanks for surfacing this. If you click to "tools" button to the left of "compile", you'll see a list of comments, and you can resolve them from there. We'll keep improving and fixing things that might be rough around the edges.
Eh. This is yet another "I tried AI to do a thing, and it didn't do it the way I wanted it, therefore I'm convinced that's just how it is... here's a blog about it" article.
"Claude tries to write React, and fails"... how many times? what's the rate of failure? What have you tried to guide it to perform better.
These articles are similar to HN 15 years ago when people wrote "Node.JS is slow and bad"
It's crazy how the etymology of "Kentucky" cannot be traced with certainty. Goes to show how much of the native American culture and language is now untraceable and how fragile our record-keeping is, even in "modern times".
The etymology I’ve heard isn’t even listed in the article.
One theory traces “Kentucky” to early forms like Cantucky or Cane-tucky, referring to the region’s vast river-cane brakes, Kentucky River cane, North America’s only native bamboo, which early inhabitants associated with fertile, game-rich land.
I always wondered how something like AWS or GCP Cloud Console admin UIs get shipped. How could someone deliver a product like these and be satisfied, rewarded, promoted, etc. How can Google leadership look at this stuff and be like... "yup, people love this".
In defense of AWS consoles, they are derivative of AWS APIs, as such they are really just a convenience layer that will only occasionally string 2+ AWS APIs together for convenience purposes that can be considered distinct feature on the console.
That is wholly unlike the problem here where the console and API somehow behaves completely differently.
Along with the public APIs, An AWS service can also have Console APIs that are specifically for the console. These APIs do not have the same constraints as the public api.
Object.defineProperty on every request to set params / query / body is probably slower than regular property assignment.
Also parsing the body on every request without the ability to change it could hurt performance (if you're going for performance that is as a primary factor).
I wonder if the trie-based routing is actually faster than Elysia in precompile mode set to enabled?
Overall, this is a nice wrapper on top of bun.serve, structured really well. Code is easy to read and understand. All the necessary little things taken care of.
The dev experience of maintaining this is probably a better selling point than performance.
In my opinion, attempting to perform live dictation is a solution that is looking for a problem. For example, the way I'm writing this comment is: I hold down a keyboard shortcut on my keyboard, and then I just say stuff. And I can say a really long thing. I don't need to see what it's typing out. I don't need to stream the speech-to-text transcription. When the full thing is ingested, I can then release my keys, and within a second it's going to just paste the entire thing into this comment box. And also, technical terms are going to be just fine with Whisper. For example, Here's a JSON file.
(this was transcribed using whisper.cpp with no edits. took less than a second on a 5090)
Yea whisper has more features and is awesome if you have the hardware to run the big models that are accurate enough. The constraint here is the best cpu only implementation. By no means am I wedded or affiliated with parakeet, it's just the best/fastest within the CPU hardware space.
I've done something similar for Linux and Mac. I originally used Whisper and then switched to Parakeet. I much prefer whisper after playing with both. Maybe I'm not configuring Parakeet correctly, But the transcription that comes out of Whisper is usually pretty much spot on. It automatically removes all the "ooms" and all the "ahs" and it's just way more natural, in my opinion. I'm using Whisper.CPP with CUDA acceleration. This whole comment is just written with me dictating to a whisper, and it's probably going to automatically add quotes correctly, there's going to be no ums, there's going to be no ahs, and everything's just going to be great.
If you don't mind closed source paid app, I can recommend MacWhisper. You can select different models of Whisper & Parakeet for dictation and transcription. My favorite feature is that it allows sending the transcription output to an LLM for clean-up, or anything you want basically eg. professional polish, translate, write poems etc.
I have enough RAM on my Mac that I can run smaller LLMs locally. So for me the whole thing stays local
reply