Pretty nice. I've been using LLMs to generate different Python and JS tools for wrangling data for ontology engineering purposes.
More recently, I've found a lot of benefit from using the extended thinking mode in GPT-5 and -5.1. It tends to provide a fully functional and complete result from a zero-shot prompt. It's as close as I've gotten to pair programming with a (significantly) more experienced coder.
One functional example of that (with 30-50% of my own coding, reprompting and reviews) is my OntoGSN [1] research prototype. After a couple of weeks of work, it can handle different integration, reasoning and extension needs of people working in assurance, at least based on how I understood them. It's an example of a human-AI collab that I'm particularly proud of.
I looked at the original study [1], and it seems to be a very well-supported piece of research. All the necessary pieces are there, as you would expect from a Nature publication. And overall, I am convinced there's an effect.
However, I'm still skeptical of the effects or size of the change. First, a point that applies to the Massachusetts ballot on psychedelics in particular, putting views into percentages, and getting accurate results from political polls are notoriously difficult tasks [2]. Therefore, the size of any effect is faced with whatever confounding variables make those tasks difficult.
Second, there could be some level Hawthorne effect [3] at play here, such that participants may report being (more) convinced because that's what (they think) is expected of them. I'm not familiar with the recruiting platforms they used, but if they're specialized in paid or otherwise professional surveys, I wonder if participants feel an obligation to perform well.
Third, and somewhat related to the above, participants could state they'd vote Y after initially reporting X preference, because they know it's a low-cost no-commitment claim. In other words, they can claim they'd now vote for Y without fear of judgement because it's a lab environment and an anonymous activity, but they can always go back to their original position once the actual vote happens. To show the size of the effect with respect to other things, researchers will have to make the stakes higher, or follow-up with participants after the vote and find out if/why they changed their mind (again).
Fourth, if one 6-minute-average conversation with a chatbot could convince an average voter, I wonder how much did they know about the issue/candidate being voted on. More cynically for the study, there may be much more at play with actual vote preference than a single dialectic presentation of facts. For example: salient events that happen in the period up to the election; emotional connection with the issue/candidate; personal experiences.
Still, this does not make the study flawed for not covering everything. We can learn a lot from this work, and kudos to the authors for publishing it.
I guess vibe coding is fun as a meme, but it hides the power of (what someone else on HN) called language user interfaces (LUIs).
The author's point is correct IMO. If you have direct mappings between assembly and natural language, there's no functional need for these intermediate abstractions to act as pseudo-LUIs. If you could implement it, you would just need two layers above assembly: an LLM OS [1], and a LUI-GUI combo.
However, I think there's a non-functional, quality need for intermediate abstractions - particularly to make the mappings auditable, maintainable [2], understandable, etc. For most mappings, there won't be a 1:1 representation between a word and an assembly string.
It's already difficult for software devs to balance technical constraints and possibilities with vague user requirements. I wonder how an LLM OS would handle this, and why we would trust that its mappings are correct without wanting to dig deeper.
[1] Coincidentally, just like "vibe coding", this term was apparently also coined by Andrej Karpathy.
[2] For example, good luck trying to version control vectors.
The pre-LLM equivalent would be: "I googled this, and here's what the first result says," and copying the text without providing any additional commentary.
Everyone should be free to read, interpret and formulate their comments however they'd like.
But if a person outsources their entire thinking to an LLM/AI, they don't have anything to contribute to the conversation themselves.
And if the HN community wanted pure LLM/AI comments, they'd introduce such bots in the threads.
I'm wondering if it might be impossible to write a law that both prevents the sprit of what we want it to prevent, while also not preventing the spirit of what we don't want to prevent. :)
The abuse of claims and citations is a legitimate and common problem.
However, I think hallucinated citations pose a bigger problem, because they're fundamentally a lie by commission instead of omission, misinterpretation or misrepresentation of facts.
At the same time, it may be an accidental lie, insofar authors mistakenly used LLMs as search engines, just to support a claim that's commonly known, or that they remember well but can't find the origin of.
So, unless we reduce the pressure on publication speed, and increase the pressure for quality, we'll need to introduce more robust quality checks into peer review.
I haven't come across any reviews that I could recognize as having been blatantly LLM-generated.
However, almost every peer review I was a part of, pre- and post-LLM, had one reviewer who provided a questionable review. Sometimes I'd wonder if they'd even read the submission, and sometimes, there were borderline unethical practices like trying to farm citations through my submission. Luckily, at least one other diligent reviewer would provide a counterweight.
Safe to say that I don't find it surprising, and hearing / reading others' experiences tells me it's yet another symptom of a barely functioning mechanism that is peer review today.
Sadly, it's the best mechanism that institutions are willing to support.
More recently, I've found a lot of benefit from using the extended thinking mode in GPT-5 and -5.1. It tends to provide a fully functional and complete result from a zero-shot prompt. It's as close as I've gotten to pair programming with a (significantly) more experienced coder.
One functional example of that (with 30-50% of my own coding, reprompting and reviews) is my OntoGSN [1] research prototype. After a couple of weeks of work, it can handle different integration, reasoning and extension needs of people working in assurance, at least based on how I understood them. It's an example of a human-AI collab that I'm particularly proud of.
[1] Playground at w3id.org/OntoGSN/
reply