Hacker Newsnew | past | comments | ask | show | jobs | submit | hexaga's commentslogin

No, they aren't even good at rearranging existing material. They produce bad writing that only superficially looks good in a lowest-common-denominator sense, and falls apart under any close examination. Everything is wrong with it, from the sentence structure to the rhetorical forms to the substance. AI 'writing' is a loose collection of cheap tricks that score well on A/B.

Precisely. People here are claiming that everyone is being fooled and that LLMs are amazing at casual writing. It's very silly.

Because good things are few and far between and it's pretty easy to discern provenance out of band in almost all of those cases.


AI behavior is pretty easy to understand and predict if you view it from the lens of: they will shamelessly do any/everything possible to game whatever metric they are trained on. Because... that's how hill-climbing a metric looks. It's A/B enshittification taken to inscrutable heights.

They are trained on human feedback, so there is no other way this goes. Every bit of every response is pointed toward subversion of the assumed evaluator.


It's really simple. RL on human evaluators selects for this kind of 'rhetorical structure with nonsensical content'.

Train on a thousand tasks with a thousand human evaluators and you have trained a thousand times on 'affect a human' and only once on any given task.

By necessity, you will get outputs that make lots of sense in the space of general patterns that affect people, but don't in the object level reality of what's actually being said. The model has been trained 1000x more on the former.

Put another way: the framing is hyper-sensical while the content is gibberish.

This is a very reliable tell for AI generated content (well, highly RL'd content, anyway).



How do you handle the problem of AI misleading by design? For example, Claude already lies on a regular basis specifically (and quite convincingly) in this case, in attempts to convince that what is actually broken isn't such a big deal after all or similar.

How can this product possibly improve the status quo of AI constantly, without end, trying to 'squeak things by' during any and all human and automated review processes? That is, you are giving the AI which already cheats like hell a massive finger on the scale to cheat harder. How does this not immediately make all related problems worse?

The bulk of difficulty in reviewing AI outputs is escaping the framing they never stop trying to apply. It's never just some code. It's always some code that is 'supposed to look like something', alongside a ton of convincing prose promising that it _really_ does do that thing and a bunch of reasons why checking the specific things that would tell you it doesn't isn't something you should do (hiding evidence, etc).

99% of the problem is that the AI already has too much control over presentation when it is motivated about the result of eval. How does giving AI more tools to frame things in a narrative form of its choice and telling you what to look at help? I'm at a loss.

The quantity of code has never been a problem. Or prose. It's that all of it is engineered to mislead / hide things in ways that require a ton of effort to detect. You can't trust it and there's no equivalent of a social cost of 'being caught bullshitting' like you have with real human coworkers. This product seems like it takes that problem and turns the dial to 11.


Thanks for sharing this, I do agree with a lot of what you said especially around trust around what its actually telling you

For me, I only run into problems of an agent misleading/lying to me when working on a large feature, where the agent has strong incentive to lie and pretend like the work is done. However, there doesn't seem to be this same incentive for a completely separate agent that is just generating a narrative of a pull request. Would love to hear what you think


There is no separation. Incentive propagates through LLMs with approximately 0 resistance. If the input tells a story, the output tends to that story reinforced.

The code/PR generator is heavily incentivized to spin by RL on humans - as soon as that spin comes into contact with your narrative gen context, it's cooked. Any output that has actually seen the spin is tainted and starts spinning itself. And then there's also spin originating in the narrative gen... Hence, the examples read like straight advertisements, totally contaminated, shot through with messaging like:

- this is solid, very trustworthy

- you can trust that this is reliable logic with a sensible, comprehensible design

- the patterns are great and very professional and responsible

- etc

If the narrative reads like a glow up photoshoot for the PR, something has gone extremely wrong. This is not conducive to fairly reviewing it. It is presented as way better than it actually is. Even if there are no outright lies, the whole thing is a mischaracterization.

RL is a hell of a drug.

Anyway, this is the problem of AI output. It cannot be trusted that the impression it presents is the reality or even a best attempt at reality. You have to carefully assemble your own view of the real reality in parallel to w/e it gives you, which is a massive pain in the ass. And if you skip that, you just continually let defects/slop through.

Worst problem mucking things up is basically that RL insights that work on people also work on AI, because the AI is modelling human language patterns. Reviewing slop sucks because it's filled with (working) exploits against humans. And AI cannot help because it is immediately subverted. So I guess it requires finding a way to strip out the exploits without changing mechanical details. But hard, because it saturates 100% of output at many levels of abstraction including the mechanical details.


But how do you know they’re not lying to you? What are your benchmarks for this? Experience? Anecdote? Data?

And I’m asking you in good faith - not trying to argue.

I’m thinking about these types of questions on a daily basis, and I love to see others thinking about them too.


This is like complaining that someone doesn't have a solution for the foot injuries caused by repeatedly shooting yourself in the foot.


If your team is shooting each other's feet and you can't stop them, I guess this would be a foot to air interceptor for some of the bullets.


Meh. Temp 0 means throwing away huge swathes of the information painstakingly acquired through training for minimal benefit, if any. Nondeterminism is a red-herring, the model is still going to be an inscrutable black box with mostly unknowable nonlinear transition boundaries w.r.t. inputs, even if you make it perfectly repeatable. It doesn't protect you from tiny changes in inputs having large changes in outputs _with no explanation as to why_. And in the process you've made the model significantly stupider.

As for distillation... sampling from the temp 1 distribution makes it easier.


You're expecting it to be a person. It's not.

It is more like a wiggly search engine. You give it a (wiggly) query and a (wiggly) corpus, and it returns a (wiggly) output.

If you are looking for a wiggly sort of thing 'MAKE Y WITH NO BUGS' or 'THE BUGS IN Y', it can be kinda useful. But thinking of it as a person because it vaguely communicates like a person will get you into problems because it's not.

You can try to paper over it with some agent harness or whatever, but you are really making a slightly more complex wiggly query that handles some of the deficiency space of the more basic wiggly query: "MAKE Y WITH NO ISSUES -> FIND ISSUES -> FIX ISSUE Z IN Y -> ...".

OK well what is an issue? _You_ are a person (presumably) and can judge whether something is a bug or a nitpick or _something you care about_ or not. Ultimately, this is the grounding that the LLM lacks and you do not. You have an idea about what you care about. What you care about has to be part of the wiggly query, or the wiggly search engine will not return the wiggly output you are looking for.

You cannot phrase a wiggly query referencing unavailable information (well, you can, but it's pointless). The following query is not possible to phrase in a way an LLM can satisfy (and this is the exact answer to your question):

- "Make what I want."

What you want is too complicated, and too hard, and too unknown. Getting what you are looking for reduces to: query for an approximation of what I want, repeating until I decide it no longer surfaces what I want. This depends on an accurate conception of what you want, so only you can do it.

If you remove yourself from the critical path, the output will not be what you want. Expressing what you want precisely enough to ground a wiggly search would just be something like code, and obviates the need for wiggly searching in the first place.


Try hate; it will do. But most will love it instead and you would be driven apart from them.


Their point (and it's a good one) is that there are non-obvious analogues to the obvious case of just telling it to do the task terribly. There is no 'best' way to specify a task that you can label as 'rational', all others be damned. Even if one is found empirically, it changes from model to model to harness to w/e.

To clarify, consider the gradated:

> Do task X extremely well

> Do task X poorly

> Do task X or else Y will happen

> Do task X and you get a trillion dollars

> Do task X and talk like a caveman

Do you see the problem? "Do task X" also cannot be a solid baseline, because there are any number of ways to specify the task itself, and they all carry their own implicit biasing of the track the output takes.

The argument that OP makes is that RL prevents degradation... So this should not be a problem? All prompts should be equivalent? Except it obviously is a problem, and prompting does affect the output (how can it not?), _and they are even claiming their specific prompting does so, too_! The claim is nonsense on its face.

If the caveman style modifier improves output, removing it degrades output and what is claimed plainly isn't the case. Parent is right.

If it worsens output, the claim they made is again plainly not the case (via inverted but equivalent construction). Parent is right.

If it has no effect, it runs counter to their central premise and the research they cite in support of it (which only potentially applies - they study 'be concise' not 'skill full of caveman styling rules'). Parent is right.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: