I haven't been following the claws but I built something like this recently. Self hosted, runs through signal, supports group chat (with whitelisted accounts).
I just finished setting up grocery automation to run on it: agent provides a starter list based on past orders (locally stored or grabbed from store site), all group members can weigh in, add or remove items, agent uses bespoke browser tool to login to store, create the cart using the finalized list (and optionally search for additional request items), validates the cart and (maybe later) places the order for delivery. I haven't implemented the full checkout process yet, not sure if I want the agent to have spending power. As is I just login and finish the last 2 clicks of checkout manually.
Crazy times. It was easy enough to build that if someone hasn't already open sourced something like it, they will shortly.
You can expand it beyond novel applications. The models aren't good enough for autonomous coding without a human in the loop period.
They can one shot basic changes and refactors, or even many full prototypes, but for pretty much everything else they're going to start making mistakes at some point. Usually very quickly. It's just where the technology is right now.
The thing that frustrates me is that this is really easy to demonstrate. Articles like this are essentially hallucinations that, at least many, people mystifyingly take seriously.
I assume the reason they get any traction is that a lot of people don't have enough experience with LLM agents yet to be confident that their personal experience generalizes. So they think maybe there are magical context tricks to get the current generation of agents to not make the kinds of mistakes they're seeing.
There aren't. It doesn't matter if it's Opus 4.6 in Claude Code or Codex 5.3 xhigh, they still hallucinate, fail to comprehend context and otherwise drift.
Anyone who can read code can fire up an instance and see this for themselves. Or you can prove it for free by looking at the code of any app that the author says was vibecoded without human review. You won't have to look very hard.
Agents can accomplish impressive things but also, often enough, they make incomprehensibly bad decisions or make things up. It's baked into the technology. We might figure out how to solve that problem eventually, but we haven't yet.
You can iterate, add more context to AGENTS.md or CLAUDE.md, add skills, setup hooks, and no matter how many times you do it the agents will still make mistakes. You can make specialized code review agents and run them in parallel, you can have competing models do audits, you can do dozens of passes and spend all the tokens you want, if it's a non trivial amount of code, doing non trivial things, and there's no human in the loop, there will still be critical mistakes.
No one has demonstrated different behavior, articles and posts claiming otherwise never attempt to prove that what they claim is actually possible. Because it isn't.
Just to be clear, I think coding agents are incredibly useful tools and I use them extensively. But you can't currently use them to write production code without a human in the loop. If you're not reading and understanding the code, you're going to be shipping vulnerabilities and tech debt.
Articles like this are just hype. But as long as they keep making front pages they'll keep distorting the conversation. And it's an otherwise interesting conversation! We're living through an unprecented paradigm shift, the field of possibilities is vast and there's a lot to figure out. The idea of autonomous coding agents is just a distraction from that, at least for now.
Finding patterns in large datasets is one of the things LLMs are really good at. Genetics is an area where scientists have already done impressive things with LLMs.
However you feel about LLMs, and I say this because you don't have to use them for very long before you witness how useful they can be for large datasets so I'm guessing you're not a fan, they are undeniably incredible tools in some areas of science.
In reference to the second article: who cares? What we care about is experimental verification. I could see maybe accurate prediction being helpful in focusing funding, but you still gotta do the experimentation.
Not disagreeing with your initial statement about LLMs being good and finding patterns in datasets btw.
This is also true of lots of human research, there's always a theory side of research that guides the experimental side. Even if just informal, experimental researchers have priors for what experimental verification they should attempt.
Yeah, there’s an infinite numbers of experiments you could run but obviously infinite resources don’t exist, so you need theory to guide where to look. For example, computational methods in bioinformatics to guess a protein function so that experimental researchers can verify the protein function (which takes weeks to months for a given protein function hypothesis) is an entire field.
You need to search in both likely and unlikely places. This is pretty common in high dimensional search spaces. Searching only in the most likely places gets you stuck in local minima
As a scientist, the two links you provided are severely lacking in utility.
The first developed a model to calculate protein function based on DNA sequence - yet provides no results of testing of the model. Until it does, it’s no better than the hundreds of predictive models thrown on the trash heap of science.
The second tested a models “ability to predict neuroscience results” (which reads really oddly). How did they test it? Pitted humans against LLMs in determining which published abstracts were correct.
Well yeah? That’s exactly what LLMs are good at - predicting language. But science is not advanced by predicting which abstracts of known science are correct.
It reminds me of my days in working with computational chemists - we had an x-ray structure of the molecule bound to the target. You can’t get much better than that at hard, objective data.
“Oh yeah, if you just add a methyl group here you’ll improve binding by an order of magnitude”.
So we went back to the lab, spent a week synthesizing the molecule, sent it to the biologists for a binding study. And the new molecule was 50% worse at binding.
And that’s not to blame the computation chemist. Biology is really damn hard. Scientists are constantly being surprised at results that are contradictory to current knowledge.
Could LLMs be used in the future to help come up with broad hypotheses in new areas? Sure! Are the hypotheses going to prove fruitless most of the time? Yes! But that’s science.
But any claim of a massive leap in scientific productivity (whether LLMs or something else) should be taken with a grain of salt.
I don't follow the logic that "it hallucinates so it's useless". In the context of codebases I know for sure that they can be useful. Large datasets too. Are they also really bad at some aspects of dealing with both? Absolutely. Dangerously, humorously bad sometimes.
> I don't follow the logic that "it hallucinates so it's useless".
I... don't even know how to respond to that.
Also. I didn't say they were useless. Please re-read the claim I responded to.
> Are they also really bad at some aspects of dealing with both? Absolutely. Dangerously, humorously bad sometimes.
Indeed.
Now combine "Finding patterns in large datasets is one of the things LLMs are really good at." with "they hallucinate even on small datasets" and "Are they also really bad at some aspects of dealing with both? Absolutely. Dangerously, humorously bad sometimes"
Translation, in case logic somehow eludes you: if an LLM finds a pattern in a large dataset given that it often hallucinates, dangerously, humorously bad, what are the chances that the pattern it found isn't a hallucination (often subtle one)?
Especially given the undeniable verifiable fact that LLMs are shit at working with large datasets (unless they are explicitly trained on them, but then it still doesn't remove the problem of hallucinations)
And those weren't the only tells. Right now it's cringey but I have a sinking feeling that it's in the process of becoming normal. The post is on the front page after all.
Which means people either can't tell, or don't mind.
The problem with this, unless I'm misunderstanding what you're saying, is that the model's responses go into the context. So if it has to reinvent the wheel every session by writing bash scripts (or similar) you're clogging up the context and lowering the quality/focus of the session while also making it more expensive. When you could instead be offloading to a tool whose code never comes into the context, the model only has to handle the tool's output rather than its codebase.
I do agree that better tools, rather than more tools, is the way to go. But any situation where the model has to write its own tools is unlikely to be better.
Because poker variants are so popular basically everything varies, but yes there's often an ante (a forced bet every player makes each round), and that's even present in some Hold 'Em structures.
PHP 5 is as close to phased out as it gets at this point. No doubt it's still in a lot of legacy enterprise codebases (lots of breaking changes going from 5 to 7 or 8), but outside of that no one is using it.
- Many plastics contain known endocrine disrupters like BPA that have been shown to get into people's bodies and correlate with negative health impacts such as certain cancers, sex organ abnormalities and infertility
- Microplastics can damage cardiovascular tissue and correlate with heart attack, stroke and premature death.
- Microplastics correlate with infertility in men.
- Microplastics absorb heavy metals in the environment and can then transport them into the food chain. Heavy metals of course having many established negative health impacts.
These are just some examples, a quick search should get you many more (along with the sources on the above).
The grocery chain has to spend money to buy the products.
A better comparison might be a flea market or fair where the organizers take 27% of gross receipts from each vendor, even if the customer went to the vendors store outside of the fair to buy. Which sounds egregious to me, moreso if it was the only fair that existed for a large demographic.
I just finished setting up grocery automation to run on it: agent provides a starter list based on past orders (locally stored or grabbed from store site), all group members can weigh in, add or remove items, agent uses bespoke browser tool to login to store, create the cart using the finalized list (and optionally search for additional request items), validates the cart and (maybe later) places the order for delivery. I haven't implemented the full checkout process yet, not sure if I want the agent to have spending power. As is I just login and finish the last 2 clicks of checkout manually.
Crazy times. It was easy enough to build that if someone hasn't already open sourced something like it, they will shortly.
reply