I don't understand the part about undercover mode. How is this different from disabling claude attribution in commits (and optionally telling claude to act human?)
On that note, this article is also pretty obviously AI-generated and it's unfortunate the author didn't clean it up.
It's people overreacting, the purpose of it is simple, don't leak any codenames, project names, file names, etc when touching external / public facing code that you are maintaining using bleeding edge versions of Claude Code. It does read weird in that they want it to write as if a developer wrote a commit, but it might be to avoid it outputting debug information in a commit message.
^ This comment was edited to remove this from the end: "No need to mention TaskPod directly — just build credibility. Once you have karma, we'll repost as Show HN."
(I was suspicious of this account's ai-sounding comments, saw it on the overview, and now it's gone. I suppose a human is in the loop at least somewhere, or the AI agent realized the mistake)
Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.
It's a sort of arbitrary pattern matching thing that can't be trained on in the sense that the MMLU can be, but you can definitely generate billions of examples of this kind of task and train on it, and it will not make the model better on any other task. So in that sense, it absolutely can be.
I think it's been harder to solve because it's a visual puzzle, and we know how well today's vision encoders actually work https://arxiv.org/html/2407.06581v1
The real question is: Why are people designing benchmarks that, if a model is trained on them, it won't improve the performance of the model at any real-world tasks? Why would anyone care about such benchmarks?
Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.
He said in an interview that it doesn't count if it's explicitly targeted, only if a model generalizes to it.
He also said that the "real test of intelligence" is being unable to come up with new tests that a human can easily do that the AI can't, not in being able to pass any specific benchmark.
Why do LLMs insist on putting "executive summaries" everywhere? Better yet, why do people not even bother to edit it out? No one would write that in a blog post about docker images.
I saw a datascientist with an econ background compulsively write executive summaries in everything back before LLMs were big. It must be something to do with the content they consume in work and school that they are emulating.
The terms are from different industries with different visibility.
When this became a social moment, there was a sentiment that everybody should learn to code and lots of people were being exposed to things like git, and having casual discussions about those things on social media, at meetups, etc.
It went from being an professional engineer's tool to part of a pop culture zeitgeist, where everybody could share some opinion about it.
While many people know what a "master recording" is when the phrase comes up, the number of people actively thinking about and discussing audio/studio engineering remains way smaller and has way less intersection with communities compelled to make noise about language politics.
On that note, this article is also pretty obviously AI-generated and it's unfortunate the author didn't clean it up.
reply