Thanks , I found it after clicking through to the actual nature paper, where it’s a detail buried deep down in the paper. They really should have mentioned it up front.
I think some of the distinction here is that the more recent "bare LLMs" have been more purpose built, augmented with "agent" specific RL, and in general more fine tuned for the requirements of "agents". Things such as specific reasoning capabilities, tool calling, etc.
These all make the "bare LLMs" better suited to be used within the "agent" harness.
I think the more accurate term would be "agentic LLMs" instead of calling them "agents" outright. As to why its the case now, probably just human laziness and colloquialisms.
GPT 5.2 in a simple while loop runs circles around most things right now. It was released barely a month ago and many developers have been on vacation/hibernating/etc. during this time.
I give it 3-4 more weeks before we start to hear about the death of agentic frameworks. Pointing GPT5+ at a powershell or C#/Python REPL is looking way more capable than wiring up a bunch of domain-specific tools. A code-based REPL is the ultimate tool. You only need one and you can force the model to always call it (100% chance of picking the right tool). The amount of integration work around Process.Start is approximately 10-15 minutes, even if you don't use AI assistance.
My definition of agent has always been an LLM with "effectful" tools, run in a loop where the LLM gets to decide when the task is complete. In other words, an LLM with "agency".
Parakeet V3 is near-instant transcription, and the slight accuracy drop relative to the slower/bigger Whisper models is immaterial when talking to AIs that can “read between the lines”.
This is not strictly speech-to-speech, but I quite like it when working with Claude Code or other CLI Agents:
STT: Handy [1] (open-source), with Parakeet V3 - stunningly fast, near-instant transcription. The slight accuracy drop relative to bigger models is immaterial when you're talking to an AI. I always ask it to restate back to me what it understood, and it gives back a nicely structured version -- this helps confirm understanding as well as likely helps the CLI agent stay on track.
TTS: Pocket-TTS [2], just 100M params, and amazing speech quality (English only).
I made a voice plugin [3] based on this, for Claude Code so it can speak out short updates whenever CC stops. It uses a non-blocking stop hook that calls a headless agent to create the 1/2-sentence summary. Turns out to be surprisingly useful. It's also fun as you can customize the speaking style and mirror your vibe etc.
The voice plugin gives commands to control it:
/voice:speak stop
/voice:speak azelma (change the voice)
/voice:speak <your arbitrary prompt to control the style or other aspects>
Nice, I’ll have to try it out. They should really make a uv-installable CLI tool like pocket-TTS did. People underestimate just how much more immediately usable something becomes when you can simply get something by doing “uv tool install …”
Hi, so I'm looking for an stt that can happen on a server/cron, that will use a small local model (I have 4 vCPU threadripper CPU only and 20G ram on the server) and be able to transcribe from remote audio URLs (preferably, but I know that local models probably don't have this feature so will have to do something like curl the audio down to memory or /tmp and then transcribe and then remove the file etc).
As others said this was possible for months already with llama-cop’s support for Anthropic messages API. You just need to set the ANTHROPIC_BASE_URL. The specific llama-server settings/flags were a pain to figure out and required some hunting, so I collected them in this guide to using CC with local models:
One tricky thing that took me a whole day to figure out is that using Claude Code in this setup was causing total network failures due to telemetry pings, so I had to set this env var to 1: CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC
Curious how it compares to last week’s release of Kyutai’s Pocket-TTS [1] which is just 100M params, and excellent in both speed and quality (English only). I use it in my voice plugin [2] for quick voice updates in Claude Code.
You are absolutely right — most internet users don't know the specific keyboard combination to make an em dash and substitute it with two hyphens. On some websites it is automatically converted into an em dash. If you would like to know more about this important punctuation symbol and it's significance in identitifying ai writing, please let me know.
Thanks for that. I had no idea either. I'm genuinely surprised Windows buries such a crucial thing like this. Or why they even bothered adding it in the first place when it's so complicated.
The Windows version is an escape hatch for keying in any arbitrary character code, hence why it's so convoluted. You need to know which code you're after.
To be fair, the alt-input is a generalized system for inputting Unicode characters outside the set keyboard layout. So it's not like they added this input specifically. Still, the em dash really should have an easier input method given how crucial a symbol it is.
It's a generalized system for entering code page glyphs that was extended to support Unicode. 0150 and 0151 only work if you are on CP1252 as those aren't the Unicode code points.
And Em Dash is trivially easy on iOS — you simply hold press on the regular dash button - I’ve been using it for years and am not stopping because people might suddenly accuse me of being an AI.
Context filling up is sort of the Achilles heel of CLI agents. The main remedy is to have it output some type of handoff document and then run /compact which leaves you with a summary of the latest task. It sort of works but by definition it loses information, and you often find yourself having to re-explain or re-generate details to continue the work.
I made a tool[1] that lets you just start a new session and injects the original session file path, so you can extract any arbitrary details of prior work from there using sub-agents.
reply