Hacker Newsnew | past | comments | ask | show | jobs | submit | motoboi's commentslogin

The whole “chat with an AI” paradigm is the culprit here. Priming people to think they are actually having a conversation with something that has a mind model.

It’s just a text generator that generates plausible text for this role play. But the chat paradigm is pretty useful in helping the human. It’s like chat is a natural I/O interface for us.


I disagree that it’s “just a text generator” but you are so right about how primed people are to think they’re talking to a person. One of my clients has gone all-in on openclaw: my god, the misunderstanding is profound. When I pointed out a particularly serious risk he’d opened up, he said, “it won’t do that, because I programmed it not to”. No, you tried to persuade it not to with a single instruction buried in a swamp of markdown files that the agent is itself changing!


I insist on the text generator nature of the thing. It’s just that we built harnesses to activate on certain sequences of text.

Think of it as three people in a room. One (the director), says: you, with the red shirt, you are now a plane copilot. You, with the blue shirt, you are now the captain. You are about to take off from New York to Honolulu. Action.

Red: Fuel checked, captain. Want me to start the engines?

Blue: yes please, let’s follow the procedure. Engines at 80%.

Red: I’m executing: raise the levers to 80%

Director: levers raised.

Red: I’m executing: read engine stats meters.

Director: Stats read engine ok, thrust ok, accelerating to V0.

Now pretend the director, when heard “I’m executing: raise the levers to 80%”, instead of roleplaying, she actually issue a command to raise the engine levers of a plane to 80%. When she hears “I’m executing: read engine stats”, she actually get data from the plane and provide to the actor.

See how text generation for a role play can actually be used to act on the world?

In this mind experiment, the human is the blue shirt, Opus 4-6 is the red and Claude code is the director.


For context I've been an AI skeptic and am trying as hard as I can to continue to be.

I honestly think we've moved the goalposts. I'm saying this because, for the longest time, I thought that the chasm that AI couldn't cross was generality. By which I mean that you'd train a system, and it would work in that specific setting, and then you'd tweak just about anything at all, and it would fall over. Basically no AI technique truly generalized for the longest time. The new LLM techniques fall over in their own particular ways too, but it's increasingly difficult for even skeptics like me to deny that they provide meaningful value at least some of the time. And largely that's because they generalize so much better than previous systems (though not perfectly).

I've been playing with various models, as well as watching other team members do so. And I've seen Claude identify data races that have sat in our code base for nearly a decade, given a combination of a stack trace, access to the code, and a handful of human-written paragraphs about what the code is doing overall.

This isn't just a matter of adding harnesses. The fields of program analysis and program synthesis are old as dirt, and probably thousands of CS PhD have cut their teeth of trying to solve them. All of those systems had harnesses but they weren't nearly as effective, as general, and as broad as what current frontier LLMs can do. And on top of it all we're driving LLMs with inherently fuzzy natural language, which by definition requires high generality to avoid falling over simply due to the stochastic nature of how humans write prompts.

Now, I agree vehemently with the superficial point that LLMs are "just" text generators. But I think it's also increasingly missing the point given the empirical capabilities that the models clearly have. The real lesson of LLMs is not that they're somehow not text generators, it's that we as a species have somehow encoded intelligence into human language. And along with the new training regimes we've only just discovered how to unlock that.


> I thought that the chasm that AI couldn't cross was generality. By which I mean that you'd train a system, and it would work in that specific setting, and then you'd tweak just about anything at all, and it would fall over. Basically no AI technique truly generalized for the longest time.

That is still true though, transformers didn't cross into generality, instead it let the problem you can train the AI on be bigger.

So, instead of making a general AI, you make an AI that has trained on basically everything. As long as you move far enough away from everything that is on the internet or are close enough to something its overtrained on like memes it fails spectacularly, but of course most things exists in some from on the internet so it can do quite a lot.

The difference between this and a general intelligence like humans is that humans are trained primarily in jungles and woodlands thousands of years ago, yet we still can navigate modern society with those genes using our general ability to adapt to and understand new systems. An AI trained on jungles and woodlands survival wouldn't generalize to modern society like the human model does.

And this makes LLM fundamentally different to how human intelligence works still.


> And I've seen Claude identify data races that have sat in our code base for nearly a decade

how do you know that claude isn't just a very fast monkey with a very fast typewriter that throws things at you until one of them is true ?


Iteration is inherent to how computers work. There's nothing new or interesting about this.

The question is who prunes the space of possible answers. If the LLM spews things at you until it gets one right, then sure, you're in the scenario you outlined (and much less interesting). If it ultimately presents one option to the human, and that option is correct, then that's much more interesting. Even if the process is "monkeys on keyboards", does it matter?

There are plenty of optimization and verification algorithms that rely on "try things at random until you find one that works", but before modern LLMs no one accused these things of being monkeys on keyboards, despite it being literally what these things are.


Of course it doesn't matter indeed. What I was hinting at is if you forget all the times the LLM was wrong and just remember that one time it was right it makes it seem much more magical than it actually might be.

Also how were the data races significant if nobody noticed them for a decade ? Were you all just coming to work and being like "jeez I dont know why this keeps happening" until the LLM found them for you?


I agree with your points. Answering your one question for posterity:

> Also how were the data races significant if nobody noticed them for a decade ?

They only replicated in our CI, so it was mainly an annoyance for those of us doing release engineering (because when you run ~150 jobs you'll inevitably get ~2-4 failures). So it's not that no one noticed, but it was always a matter of prioritization vs other things we were working on at the time.

But that doesn't mean they got zero effort put into them. We tried multiple times to replicate, perhaps a total of 10-20 human hours over a decade or so (spread out between maybe 3 people, all CS PhDs), and never got close enough to a smoking gun to develop a theory of the bug (and therefore, not able to develop a fix).

To be clear, I don't think "proves" anything one way or another, as it's only one data point, but given this is a team of CS PhDs intimately familiar with tools for race detection and debugging, it's notable that the tools meaningfully helped us debug this.


For someone claiming to be an AI skeptic, your post here, and posts in your profile certainly seem to be at least partially AI written.

For someone claiming to be an AI skeptic, you certainly seem to post a lot of pro-AI comments.

Makes me wonder if this is an AI agent prompted to claim to be against AIs but then push AI agenda, much like the fake "walk away" movement.


I have an old account, you can read my history of comments and see if my style has changed. No need to take my word for it.


Tangential off topic, but reminds me of seeing so many defenses for Brexit that started with “I voted Remain but…”

Nowadays when I read “I am an AI skeptic but” I already know the comment is coming from someone that has just downed the kool aid.


> No, you tried to persuade it not to with a single instruction

Even persuade is too strong a word. These things dont have the motivation needed to enable persuation being a thing. Whay your client did was put one data point in the context that it will use to generate the next tokens from. If that one data point doesnt shift the context enough to make it produce an output that corresponds to that daya point, then it wont. Thats it, no sentience involved


> It’s just a text generator that generates plausible text for this role play.

Often enough, that text is extremely plausible.


I pin just as much responsibility on people not taking the time to understand these tools before using them. RTFM basically.


I think the mindset you have to have is "it understands words, but has no concept of physics".


Working on TRIPA, a internal pack format to save huge amount of small images to cloud storage without dying from the write costs.


Im planning a change that will save 20k a month of storage.

I absolutely could come up with the details and implementation by myself, but that would certainly take a lot of back and forth, probably a month or two.

I’m an api user of Claude code, burning through 2k a month. I just this evening planned the whole thing with its help and actually had to stop it from implementing it already. Will do that tomorrow. Probably in one hour or two, with better code than I could ever write alone myself.

Having that level of intelligence at that price is just bollocks. I’m running out of problems to solve. It’s been six months.


In 2020 I became a full time Java developer, coming from a infrastructure role where I kind of dealt with Java code, but always as artifacts I managed in application servers and whatnot.

So when I first started dealing with the actual code, it scared me that the standard json library was basically in maintenance mode for some years back then. The standard unit test framework and lot of other key pieces too.

I interpreted that as “Java is dying”. But 6 years later I understand: they were are feature complete. And fast as hell, and god knows how many corner cases covered. They were in problem-solved, 1-in-a-billion-edge-cases-covered feature complete state.

Not abandoned or neglected, patches are incorpored in days or hours. Just… stable.

All is quiet now, they are used by millions, but remain stable. Not perfect, but their defects dependable by many. Their known bugs now features.

But it seems that no one truly want that. We want the shiny things. We wrote the same frameworks in Java, then python the go then node the JavaScript the typescript.

There must be something inherently human about changing and rewriting things.

There is indeed change in the Java ecosystem, but people just choose another name and move on. JUnit, the battle tested unit testing framework, had a lot to learn from new ways of doing, like pytest. Instead of perturbing the stableness, they just choose another name, JUnit5 and moved on.


> But it seems that no one truly want that. We want the shiny things. We wrote the same frameworks in Java, then python the go then node the JavaScript the typescript.

I think that people are just afraid that if they use a library in maintenance, they will run into a bug and it'll never get fixed. So they figure it's safer to adopt something undergoing further development, because then if there are issues they will get fixed. And of course, some people have to deal with compliance requirements which force them to only use software which is still updated.


If you need a bug to get fixed, hoping someone else is going to do it is not a good strategy anyway. Just fix the bug yourself.

I think people are mostly just cargo culting tbh.


I remember we made a switch to redis because java's memcached library was unmaintained. I made I joke that it's just feature-complete and cannot be improved upon, people chuckled, but we still did the switch.


Quite a bit of risk telescoping there...because you had the source code to the memcached library so in the theoretical case you found a bug in mature code (how many times have you seen that?), you weren't SOL. So instead you switched to an entirely new system? If you were trying to minimize risk and cost, you did the opposite unless memcached was doing something else that was a problem.


It wasn't entirely just that, we had to switch to redis for another sub-system and IIRC there some positive implication for cache layer as well. It's been awhile it wasn't just because memcached libary was unmaintained.


Honestly having looked at the memcached clients available for Java recently, I don't think any of the options could be considered feature-complete. None of the main ones support the meta protocol at all, meaning most of the advanced features aren't possible (and these are things that can't be emulated on the client side).

Hell, the main feature I needed (bulk CAS get) didn't even require the meta protocol or recent memcached features - spymemcached just never bothered to implement it. I ended up abandoning the change I was working on, because the upstream never looked at my PR and it wasn't worth forking over (bigco bureaucracy etc).

There are also quite a few legitimate bugs open for years that haven't had so much as a comment from maintainers.


I think there's a huge "it depends" caveat. In the JS world I remember browserify, it did what it was meant to do and it was extendable. A really nice Unix-like minimal software.

The reality is that it was just a small piece in a larger ideal build chain. So for the past 10+ years, we've seen an explosion of more complete build tools that do everything.

Browserify now sits there "finished" and receiving bugfixes. Nobody uses it anymore, even if it popularized npm for the frontend.


Brazil’s free software initiative in 2000’s was all about technological dependency.

Brazil was hoping to leverage governmental spending to kickstart a national software development industry. Some sort of leap into the future, jumping over first the industrial era and then service-based economy we missed.

It was killed with fire by huge Microsoft (and American, I suppose) lobbying in congress, but then America had a very favorable public view as a nurturing and democratic partner. Some sort of older brother guiding you into adulthood.

Currently, at least in my bubble, the public view of America is more like a predator with Trump as a protodictator. Not necessarily true, understand me, just as that older brother view wasn’t. But it’s public perception.

A good part of that disabling of the Brazil initiative was simply free Google workspace for public universities (which were in the government plan).

I suppose that given the existencial threat level of anxiety caused by current developments will probably make Europe government immune to American lobby (at least in the short term), so I suppose this can actually happen.

Let’s see how it develops when they try to ban Microsoft from the universities. That would be the acid test.


> It was killed with fire by huge Microsoft (and American, I suppose) lobbying in congress

Well... the bad quality of the decree itself helped at least as much as Microsoft.

Government organizations often discover it's easier to publish their software in github than to make the publishing agency accept it.

There was no migration plan, and the option that was actually pushed from the central organizations required constant contracts that were about as expensive and hard to manage as the ones with Microsoft, but hiring the government.

At the same time, the same organization that others were supposed to contract was getting delisted worldwide for bad security practices.


By typography alone I can now turbopuffer is written in zig.


It is by the juice of Zig that binaries acquire speed, the allocators acquire ownership, the ownership becomes a warning. It is by typography alone I can now turbopuffer is written in zig.


thanks for that!


This is not AI slop, it’s advertise in LLM era.



Let’s be honest. The whole thing is just the prevent Claude from “rm -rf / “.

It’s it someone is trying to avoid the thing talking to the internet or reading your emails, it’s just that it sometimes has the strange itch to change some files outside of the project.


There are thousands like you now. How many does it take to run the economy? What would the rest do.

Think of it like what a tractor did to agricultural work. The fist guy that used a tractor probably thought: this is not replacing me, I’m just much more productive. Well, turns out you only need one guy per farm now.


But now many suburban homeowners also have a little lawn tractor, and lots of people on small acreage have a utility tractor. None of them are farmers, but they get value out of the technology as well. Plus, we're feeding a lot more people for a lot less money than we did before tractors.


Yeah, but we used to employ hundreds of people per farm, or per plantation, to be exact. Thousands maybe to do the sugar cane work, as an example. Replaced by 5 high tech, GPS driven, human on board to supervise, not even to drive, tractors.

So human doing lawn with mechanized tools: efficiency goes though the roof. Still one per home.

Human doing high volume manual labor job where there were much more job than single human could handle: number of humans doing the job now is amount of work divided by amount of work human can handle.

Of course we get ambitious, like Panama Canal building ambitious. But even that can’t absorb the previous admin of people doing that kind of work.


The market for iOS todo-applications seems to be infinite, so everyone can just become a todo app developer.


gemini-cli being such a crap tells me that Google is not dogfooding it, because how else would they not have the RL trajectories to get a decent agent?

One thousand people using an agent over a month will generate like 30-60k good examples of tool use and nudge the model into good editing.

The only explanation I have is that Google is actually using something else internally.


Claude probably


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: