I use talon (and previously voicecode) for nearly all interactions with my computer for the past few years. Talon+dragon is definitely the state of the art and is quickly improving. Controlling the computer by voice has completely changed my understanding of user interfaces.
One difference between voice vs keyboard/mouse is that the number of 'commands' that can be in scope at any moment is much much larger. With a mouse you must sacrafice screen space for each command, for keyboard shortcuts the user must memorize often somewhat arbitrary key combinations for every new app/command.
With voice, there are still descoverability and some memorization issues, but they feel a lot different. I can remember the command "rename variable" much more easily than "command-alt-r", or was it "command-shift-r"? These commands can also be common across editors, reducing the friction of trying different editors.
One other observation is how I now think of apps more as (often poorly defined) sets of APIs whose inputs are keyboard/mouse events. I really wish more apps had a cleaner method of exposing this API. Some apps have a command pallet or something similar where a keyboard shortcut brings up a auto completed list of all possible actions available, which is still clumsy, but workable. Atom and jupyter notebooks both have this. Something more direct would be very nice.
Talon is looking extremely promising, with its zoom mouse that works via gaze-tracking. Generally coding by voice is a graveyard of abandoned projects. I wrote an overview of the field last year, if anyone's interested in alternatives:
I think of Talon more as a tool for building interactions like the zoom mouse easily, than the sum of its current features. Zoom mouse took me about a day at a user’s request (someone who had trouble using their neck for head movement to fine tune the cursor).
Check out Precision Gaze Mouse https://precisiongazemouse.com/. It allows you to move the mouse hands-free, which can save time versus moving your arm back and forth. It's also very accurate because it tracks both eye and head movement.
Hi, I’m the sole developer behind Talon. It’s not open source for several reasons, one of which is that I work on it full time, live on my savings, and give it away for free (because the best thing I can get out of releasing Talon is reducing hand pain in others at scale, not money). It has eye/head tracking, noise recognition, and an extremely advanced scripting engine, so it’s not just a speech recognition project. It also might come to Linux sooner than you think.
It doesn’t depend on Dragon at all, and Nuance dropping support for Dragon 6 isn’t a huge problem yet as long as you can still buy it (Talon has a builtin speech engine, and also fixes/works around most of the long-standing problems in Dragon Mac, and I’m probably more able to help with problems than their support). Integrating more engines is on the roadmap, but slightly after multi platform support (since there’s already a good free engine on Mac).
Let me know if you have any questions or if there’s anything I can do to make Talon work better for your use case.
It’s also worth reading back through my HN comments as I’ve talked in detail about Talon there.
Well, even when using it with Dragon, Talon only uses the speech recognition part. So you get Dragon’s recognition accuracy but don’t need to worry about most of the bugs.
The builtin engine is a much nicer experience than Dragon (very fast startup, no config), but simply isn’t as good at recognizing English, and there’s also a weird behavior in it I haven’t worked around yet that makes it harder to mix English with commands (it greedily prioritizes commands sometimes). The accuracy is quite good for American accents at least. I found it to be consistently more accurate than wav2letter (haven’t tried ++) and DeepSpeech with some simple test audio.
I use Caster every day. In fact I'm writing this using Dragon & Caster. There are open source scripts for most programs like Atom and Visual Studio code. Customizing them is really easy since it's open source and the scripts are just Python.
The best I know of for Linux is Aenea, which requires you to run Dragon in a virtual machine or on another computer. I will have Talon ported to Linux at some point, and that will include coming up with speech engine options for Linux. I think one of the better options is to make Windows Dragon work well in WINE. (Despite the numbers being impressive on DeepSpeech and wav2letter++, they’re not yet optimized for continuous recognition, which means you need to feed them chunks of finished recordings, and their letter based recognition requires weird fuzzy matching if you want command grammars that I have yet to solve. I’ve actually considered using the fastest continuous engine I can find, then feeding the sound into a better open source non-continuous engine after each recognition)
I've got something close to the Talon control, without eye tracking, I'm working on now and then, using the Google API for voice interpret, because CMUSphinx was awful (50% accuracy per word, whereas Google was closer to 90%).
I'm hoping that Mozilla Voice when it comes will finally solve this or make it easy to build a decent control system.
Mozilla DeepSpeech has had a release [1] that comes with a pre-trained model achieving 11% WER on clean audio in the LibriSpeech test corpus. That's close to the WER you're getting with Google, but I guess your audio quality isn't as good, so DeepSpeech would perform worse.
Mozilla Common Voice [2] is the project to collect more training data so that DeepSpeech (and other projects) can achieve the accuracy that is known to be possible with the same architecture trained on larger private datasets.
Then there's Facebook's newly released wav2letter++ [3], which claims to achieve better accuracy with the same training data. However, some people have been unable to exactly reproduce those results, getting "only" 5.15% WER [4]. Still better than what Mozilla DeepSpeech can deliver, though.
It’s not locked to macOS at all. I plan to support win/lin/mac equally, Talon has its own grammar compiler and engine-independent word parser. The main hold up for porting is all of the interaction APIs like key simulation, drawing overlays on the screen, that sort of thing.
The “core” is not open source, which is mostly low-level platform integration code. Talon is basically a pile of APIs built around a Python scripting engine. My goal is to make all the user facing features either completely defined in open source user scripts/plugins or fully scriptable/configurable. You can already contribute to the user scripts. Someone could even have a project like Caster that supports using the Talon APIs but is itself fully open source.
I have open-sourced over 100 of my projects. My current decision is to not open-source Talon at least as long as it is my full-time job for no real salary. I’m putting a lot of work into it and giving it away for free, this is what you get.
Another real user here, I loved talon. I just started using the community repository which gives you a lot of functionality. I'm currently finding that code snippets plus talon is the optimal combination for people who cannot type.
I think spinal was the command used by voicecode, and the community repo was put together by ex-voicecode users. You can change a command easily by editing the script that defines it anyway (changes take effect immediately without a restart).
I was wondering about testimonials myself. It sounds wonderful for people with physical impairments who can't use a keyboard or mouse. But in terms of productivity or ease of use for most developers, are there any benefits to using voice over a keyboard and mouse?
This is an important question for me to be able to answer, because if I can’t convince you it’s worth splitting your time between alt input (like voice and eye tracking) and keyboard/mouse, I can’t reduce your chance of developing RSI.
There’s the fear approach (I personally believe >50% of people who use computers full-time will develop at least minor hand injuries), but I don’t think fear is enough on its own.
The most promising thing I think is workflow improvements. For voice, the ability to issue commands like “next song” while typing feels amazing. You can also be more specific about many commands, like “focus chrome” is nicer than mashing cmd-tab a bunch or binding a key to each app. For eye tracking, I think autoscrolling text as you read and jumping your mouse to the right spot when you look at your second monitor are two big ones.
Similarly to "next song", I enjoy using it for things I dont use often enough to justify memorizing a shortcut or short phrase. "Connect to vpn", "move window desk 5", "rename variable", "activate virtual environment", etc. Similar to creating aliases in the terminal, but for more than just the terminal and with easy to remember spoken phrases instead of terse abbreviations: `lsa -> ls -al`.
That said, once you are proficient, writing code can definitely be as fast or faster than skilled keyboard.
Real user here: I've dabbled with other voice recognition and code-by-voice efforts in the past. Talon is the most impressive and promising by far, and is improving significantly month over month
One difference between voice vs keyboard/mouse is that the number of 'commands' that can be in scope at any moment is much much larger. With a mouse you must sacrafice screen space for each command, for keyboard shortcuts the user must memorize often somewhat arbitrary key combinations for every new app/command.
With voice, there are still descoverability and some memorization issues, but they feel a lot different. I can remember the command "rename variable" much more easily than "command-alt-r", or was it "command-shift-r"? These commands can also be common across editors, reducing the friction of trying different editors.
One other observation is how I now think of apps more as (often poorly defined) sets of APIs whose inputs are keyboard/mouse events. I really wish more apps had a cleaner method of exposing this API. Some apps have a command pallet or something similar where a keyboard shortcut brings up a auto completed list of all possible actions available, which is still clumsy, but workable. Atom and jupyter notebooks both have this. Something more direct would be very nice.