Hacker Newsnew | past | comments | ask | show | jobs | submit | verdverm's commentslogin

There was a recent study posted here that showed AI introduces regressions at an alarming rate, all but one above 50%, which indicates they spend a lot of time fixing their own mistakes. You've probably seen them doing this kind of thing, making one change that breaks another, going and adjusting that thing, not realizing that's making things worse.

The study is likely "SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration". Regression rate plot is figure 6.

Read the study to understand what it is measuring and how it was measured. As I understand parent's summary is fine, but you want to understand it first before repeating it to others.

https://arxiv.org/abs/2603.03823


I've had some luck pointing out where the AI is wrong in their sloppypasta, delicate as one can. Avoiding shame or embarrassment can be a powerful motivator.

The most interesting incident for me is having someone take our Discourse thread, paste it into AI to validate their feelings being hurt (I took a follow up prompt to go full sycophancy), and then posting the response back that lambasted me. The mods handled that one before I was aware, but I then did the same thing, giving different prompts, and never sharing the output. It was an intriguing experience and exploration. I've since been even more mindful of my writing, sometimes using similar prompts to adjust my tone or call me out. I still write the first pass myself, rarely relying on AI for editing.


Ooh, I saw a very similar situation. User went on AI and asked "Which user was disrespectful first" to dunk on another.

The person being targeted just prompted the same AI with "Which user has thin skin" and instantly the AI turn on the other person. Then the moderators got involved and told the first guy to stop using AI as a genital pleaser.


I asked Gemini what it thought, in one of the modes, it said bringing an Ai to a discussion is like bringing a gun to a knife fight, that using AI was like having a rhetorical weapon and advantage in what everyone thought was a human to human forum.

I would say LMAAFY is like LMGTFY, where as the sloppypasta is more like pasting search results list without vetting them. That is, there are two phases to this phenomenon, query and results.

Every accusation is an admission from the people running the united states right now

I think it's something they're doing consciously, accusing the opposition of doing what they're doing. It puts opposition on the back foot, muddies the water, and provides justification for doing something you shouldn't be since "the other side already did it". This us partly why politics has become like a Choose Your Own Adventure book (more than it was before anyway).

Accusation in a mirror, a strategy that is pretty much as you describe.

https://en.wikipedia.org/wiki/Accusation_in_a_mirror


Thanks, I had not heard of this.

Which is why we need to watch the polls closely this year. They’ve repeatedly accused everyone of massive voter fraud so they will most certainly try it themselves.

Slight tangent but the thought has crossed my mind that (the potential of) retaliation by Iran could be a pretense for taking broad executive action which normally wouldn't be permitted, perhaps under the guise of securing the election. They may even do something which will be ruled illegal after the fact by courts, but it could take months after the election to reach a verdict and there's really no precedent in America for broadly declaring a federal election invalid.

To be clear I don't really expect this to happen but at this point I honestly wouldn't even be surprised.


I think they were hoping to provoke riots with ICE but since they didn’t happen they’re using Iran. If there’s a “terrorist” strike in November followed by a suspension of elections that’s likely what’s happening.

Roy Cohn taught the man child well

The paid plans require you to use their interfaces or tooling. The main reason I pay per token is so I have tooling freedom. I'm not keen to let Big AI decide how I interact with this game changing technology.

I haven't yet been blocked by their tooling. All the current tools I use seem to work fine with the claude code interface, can just call it with -p.

I pay per token using an API (Vertex)

1. Higher limits and quota, goes up with more spend

2. I don't think it gets quant nerf's during busy times, since you are paying directly

3. With Gemini flash + CC prompts, it's nearly as good, so less spend and latency. I don't know how people deal with the delays between turns. I get to iterate much faster for most tasks


I don't seem to have any delays between turns with my Claude subscription, so not sure what you mean there.

Your second point seems like a guess.

For your first point, I bumped to the $100 plan because i was hitting limits with the $20 one, but haven't hit the limit with the new one yet...


I mean the 10-30s to wait for a frontier model to respond when the lite/flash models are just fine for almost all tasks

I don't want some other tooling messing with my context. It's too important to leave to something that needs to optimize across many users, there by not being the best for my specifics.

The framework I use (ADK) already handles this, very low hanging fruit that should be a part of any framework, not something external. In ADK, this is a boolean you can turn on per tool or subagent, you can even decide turn by turn or based on any context you see fit by supplying a function.

YC over indexed on AI startups too early, not realizing how trivial these startup "products" are, more of a line item in the feature list of a mature agent framework.

I've also seen dozens of this same project submitted by the claws the led to our new rule addition this week. If your project can be vibe coded by dozens of people in mere hours...


Speaking from experience - serving good context compression is not trivial.

Ymmv, I don't know why you think it's hard other than you want to sell it

Not my experience


Thanks for the reverse KYC

The next step to this is using a better tool to access containers (BuildKit), like Dagger, where you can track every step as a new container layer, time travel, share via registries...

This has been my setup since early this year, not even that much code: https://github.com/hofstadter-io/hof/tree/_next/lib/agent/se...

The bigger effort is making it play nice with vscode so you can browse and edit the files and diffs.


This still does not help with, you can call foo, but not bar. We have plenty of existing tooling for that too.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: