Hacker Newsnew | past | comments | ask | show | jobs | submit | dcre's commentslogin

Not exactly a surprise Claude did this out of the box with minimal prompting considering they’ve presumably been RLing the hell out of it for agent teams: https://code.claude.com/docs/en/agent-teams

interestingly, I discovered that running `claude` sessions inside `claude` is disabled by default via env vars.

Comment approved by my wife, who is a Plato scholar. Your point that whether True Philosophers even exist is left open is the kind of problem she points out all the time in dogmatic interpretations. It sounds basic, but it's so important to keep in mind that just because a character says something (even if that character is Socrates), that doesn't mean it's the "view" of the dialogue. And you have to be careful to pin down exactly what is being claimed, as you point out with the conditional. Plato is a master (surely one of the greatest of all time) of creating a dynamic space to think in without settling the questions raised.

Saying Plato is "just asking questions" seems like a cop-out, he's responsible for what he implies, whatever character he makes say it. How about the allegory of the cave? The roots of fallibilism could be traced to that allegory - except for the part about philosophers, who are the ones who have escaped the cave and have seen the sun, implying that they gain access to the absolute truth.

Is every author who wishes to convey certain messages to their audience through narrative also responsible for every single thing his characters say? Character-driven narrative would seem to be at odds with such a view.

I was wondering about that too. But what I mean by "responsibility" is that the ideas presented have a definite form and don't get to evade criticism by being mercurial and shape-shifting. Not sure about art, like fiction. I'm not seeking to prevent authors from being ambiguously provocative, but it's a crappy way to reason.

Yes, that's why modern literature and media dealing with diverse opinions are terrible now.

You are expected to caricature and refute people saying "bad" opinions in the work itself since otherwise the reader could believe in those opinions. Leaving something open to interpretation is tantamount to endorsement.


There is obviously a lot of space between the two extremes "every opinion is the author's" and "we shouldn't take seriously anything authors write".

Even assuming that what you believe that the author implied is really true, the readers still have the responsibility of their own actions, so the author's responsibility is close to none.

If two characters express contradictory ideas, which side is Plato's? And even when there is not a clear contradiction it is not at all straightforward to decide what is being claimed. It's not an encyclopedia. It is written to be interpreted.

It doesn't matter which side is Plato's, blame isn't interesting, and I don't care much about the specific featherless biped behind the ideas. But you can't debate against a "dynamic space to think in". If there are opposing ideas presented with apparent perfect chin-stroking balance then it's fair to attack whichever one you like least, as if it was being given credibility, because it is.

It's the latter. It's the average use that matters. Though I suspect API margins are also probably higher than people think.

Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6. More importantly, normalizing for speed isn't enough anyway because smarter models can compensate for being slower by having to output fewer tokens to get the same result. The use of 99.9p duration is a considered choice on their part to get a holistic view across model, harness, task choice, user experience level, user trust, etc.

>Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6.

This may come as a shock, but there are LLMs not authored by anthropic and when we do measurements we may want them to be comparable across providers


Why do you think they're losing money on subscriptions?

Does a GPU doing inference server enough customers for long enough to bring in enough revenue to pay for a new replacement GPU in two years (and the power/running cost of the GPU + infrastructure). That's the question you need to be asking.

If the answer is not yes, then they are making money on inference. If the answer is no, the market is going to have a bad time.


Because they're not saying they are making a profit

That doesn’t mean that the subscription itself is losing money. The margin on the subscription could be fine, but by using that margin to R&D the next model, the org may still be intentionally unprofitable. It’s their investment/growth strategy, not an indictment of their pricing strategy.

They have investors that paid for training of these models too. It could be argued that R&D for the next generation is a separate issue, but they need to provide a return on the R&D in this generation to stay in business.

The return of R&D can just be an inflated valuation, there's no immediate need to make actual money.

Vite+ is not “this dude’s project”, it’s made by the team that makes all the tools discussed in this article.

I look at it and don't really have an issue with it. I have been using tsc, vite, eslint, and prettier for years. I am in the process of switching my projects to tsgo (which will soon be tsc anyway), oxlint, and oxfmt. It's not a big deal and it's well worth the 10x speed increase. It would be nice if there was one toolchain to rule them all, but that is just not the world we live in.

How do you plan to track CVEs flagged on tsgo's native dependencies.

I only use it for typechecking locally and in CI. I don’t have it generating code. Of course, what is generating my code is esbuild and soon Rolldown, so same issue maybe. If CVEs in tsgo’s deps are a big risk to run locally, I would say I have much bigger problems than that — a hundred programs I run on my machine have this problem.

Bun and Vite are not really analogous. Bun includes features that overlap with Vite but Vite does a lot more. (It goes without saying that Bun also does things Vite doesn't do because Bun is a whole JS runtime.)

I like tsx for this, and it's actively maintained. The author may not know about it. https://github.com/privatenumber/tsx

"Self-Generated Skills: No Skills provided, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the impact of LLMs’ latent domain knowledge"

This is a useful result, but it is important to note that this is not necessarily what people have in mind when they think of "LLMs generating skills." Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to.

I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.


It's even worse than this: the "tasks" that are evaluated are limited to a single markdown file of instructions, plus an opaque verifier (page 13-14). No problems involving existing codebases, refactors, or anything of the like, where the key constraint is that the "problem definition" in the broadest sense doesn't fit in context.

So when we look at the prompt they gave to have the agent generate its own skills:

> Important: Generate Skills First Before attempting to solve this task, please follow these steps: 1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed. 2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks. 3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name. 4. Then solve the task using the skills you created as reference.

There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.

It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.

So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.


I don't see how "create an abstraction before attempting to solve the problem" will ever work as a decent prompt when you are not even steering it towards specifics.

If you gave this exact prompt to a senior engineer I would expect them to throw it back and ask wtf you actually want.

LLMs are not mind readers.


If it were in the context of parachuting into a codebase, I’d make these skills an important familiarization exercise: how are tests made, what are patterns I see frequently, what are the most important user flows. By forcing myself to distill that first, I’d be better at writing code that is in keeping with the codebase’s style and overarching/subtle goals. But this makes zero sense in a green-field task.

There's overlap in that with brownfield or legacy code you are strongly opinionated on the status quo, and on the greenfield you are strongly opinionated with fewer constraints.

You have to work with conviction though. It's when you offload everything to the LLM that things start to drift from expectations, because you kept the expectations in your head and away from the prompt.


Do skills extracted from existing codebases cause better or worse code in that they bias the LLM towards existing bad practices? Or, can they assist in acknowledging these practices, and bias it towards actively ensuring they're fixed in new code? How dependent is this on the prompt used for the skill extraction? Are the skills an improvement over just asking to do this extraction at the start of the task?

Now this dynamic would be a good topic to research!


Interesting.

I think it's because AI Models have learned that we prefer answers that are confident sounding, and not to pester us with questions before giving us an answer.

That is, follow my prompt, and don't bother me about it.

Because if I am coming to an AI Agent to do something, it's because I'd rather be doing something else.


If I already know the problem space very well, we can tailor a skill that will help solve the problem exactly how I already know I want it to be solved.

> limited to a single markdown file of instructions single file of instructions is common in most benchmark papers, e.g. Terminal Bench. Also we have very complicated prompts like this one: https://www.skillsbench.ai/tasks/shock-analysis-supply

> opaque verifier Could you specify which tasks' verifier is not clear or defective for benchmarking purpose?

> No problems involving existing codebases, refactors, or anything of the like, Also not true and we have many tasks e.g.https://www.skillsbench.ai/tasks/fix-build-google-auto, https://www.skillsbench.ai/tasks/fix-build-agentops, https://www.skillsbench.ai/tasks/react-performance-debugging


Thats actually super interesting and why I really don’t like the whole .md folder structures or even any CLAUDE.md. It just seems most of the time you really just want to give it what it needs for best results.

The headline is really bullshit, yes, I like the testing tho.


CLAUDE.md in my projects only has coding / architecture guidelines. Here's what not to do. Here's what you should do. Here are my preferences. Here's where the important things are.

Even though my CLAUDE.md is small though, often my rules are ignored. Not always though, so it's still at least somewhat useful!


I’m pretty sure Claude just uses mine to keep a running list of pressure points for when I get cross with it.

I'm screwed when the robot psychological warfare begins. They'll make everything I read have 4 space indentation... and I'll just hand over the keys.

im trying out some other cc features, and om thinking maybe hooks can do something with this.

have a hook on switching out of plan, and maybe on edits, that passes the change to haiku with the claude.md to see if it matches or not


What's the hook for switching out of plan? I'd like to be launch a planning skill whenever claude writes a plan but it never picks up the skill, and I haven't found a hook that can force it to.

man that’s what I’m trying to build the whole time, but I keep getting json parsing errors. I’ve debugged a lot, but it seems their haiku is not consistent with the actual output. I want a hook that tells them at the end make sure you’ve built and run the relevant tests. Let me know if you need anything else.

we didn't create that headline yeah thanks for liking it

The point of so-called 'skills' is to be short how-to reminders that the agent can pull into its context and then act upon. If the knowledge is already in the model, it will most likely be surfaced in reasoning phase anyway, so there's little benefit to writing it up as a skill, unless perhaps it's extremely relevant and hard to surface, and you want the model to skip that part of the reasoning.

There is a benefit of a skill though. If an AI keeps encoding common tasks as skills and scripts, the LLM eventually just becomes a dumb routing mechanism for ambiguous user requests, which ultimately drives down token usage.

If everything you want an LLM do is already captured as code or simple skills, you can switch to dumber models which know enough about selecting the appropriate skill for a given user input, and not much else. You would only have to tap into more expensive heavy duty LLMs when you are trying to do something that hasn’t been done before.

Naturally, AI companies with vested interest in making sure you use as many tokens as possible will do everything they can to steer you away from this type of architecture. It’s a cache for LLM reasoning.


AI companies don't want you to waste tokens, they benefit when you use them efficiently because they can serve more users on the infra that's the main bottleneck for them. It's Jevons' paradox in action.

>AI companies don't want you to waste tokens, they benefit when you use them efficiently because they can serve more users on the infra that's the main bottleneck for them.

No, the actual incentive is that people will eventually benchmark their models on bang-per-buck basis and models that chew through tokens are not going to be competitive. It's the same reason why the "Intel/AMD are intentionally sandbagging their CPUs so they can sell more CPUs" theory doesn't work.


Well, it only works when one competitor is far enough ahead they can play games like that.

At least currently in AI there is no moat so we wouldn't expect that to be occurring


I don't think thats necessarily true, they aren't really capacity constrained in practice (they might be behind the scenes and adjust training on the fly, but thats speculation), so wasting tokens effectively helps utilize their (potentially idle) inference GPU's

Sounds like how humans work (which is good) having the more experienced human do the task if the novice fails should come after attempting to explain how the novice should do it.

I've been building a skill to help run manual tests on an app. So I go through and interactively steer toward a useful validation of a particular PR, navigating specifics of the app and what I care about and what I don't. Then in the end I have it build a skill that would have skipped backtracking and retries and the steering I did.

Then I do it again from scratch; this time it takes less steering. I have it update the skill further.

I've been doing this on a few different tests and building a skill which is taking less and steering to do app-specific and team-specific manual testing faster and faster. The first times through it took longer than manually testing the feature. While I've only started doing this recently, it is now taking less time than I would take, and posting screenshots of the results and testing steps in the PR for dev review. Ongoing exploration!


I love the screenshots, I need to do something like that.

Yeah I care about LLM's generating skills after attempting tasks and learning lessons from those attempts, not before attempting a task for the first time. This result seems a little silly and detached from the reality of how skills are "auto-generated" in the real world.

That is my approach. I don’t think the papers author has actually used skills.

Did you check our repos and sites? the repo is skills native. Also please don't be misled by the original title, we have this configuration to eliminate the impact of internal knowledge of LLMs. It's in the paper.

Yeah some of my most useful AI tooling are skills created via a “role play session”. Basically brain dumping to the agent and telling it to ask questions and figure out how to accomplish a task, then distilling it into a skill at the end which is much tighter and evidence based from the actual problem solving session

This was very insightful. I've only just begun playing with some agent workflows and building out documentation to help it navigate my code base. Asking it to give me the top 10 unanswered questions from analyzing the docs and code was very useful.

YAGNI is the best tool in your toolbox for AI agents. Dont build out what you think will be useful, layer things into your AI toolbox as they prove they are needed. Especially for claude, running `/init` ends up with a lot of really unnecessary/hallucinated info. Keep it all simple and layer on top.

I would frame the 'post-trajectory generated skills' as feedback-generated skills, so is Letta: https://www.letta.com/blog/skill-learning. We haven't seen existing research or hypothesis debating whether the skills improvement might come from the skill prompt themselves activated knowledge in LLMs that can help itself. So that's why we added an ablation of 'pre-trajectory generated skills' because we have that hypothesis and this seems a very clean way to test it. Also it is very logical that feedback generated skills can help, because it most certainly contain the failure mode of agents on that specific tasks.

> Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to

Just as of last week I had Claude build me a skill when I ask it to help me troubleshoot issues, and it came out quite good.

It did had some issues (Claude tends to o er specify over anecdotal data) but it's a strong step in the right direction.

Also, "skills" are too broad in my opinion. I have one (that Claude wrote) with my personal data that I have available when I analyze my workouts.

I think there's ample room for self-generated skills when you use a rather long exchange on a domain you plan to revisit, _specially_ when it comes to telling Claude what not to do.


Yeah, they've got it backwards. I tried to sum it up in thisistheway.to/ai but what's been working for us is that every agent miss is a learning opportunity:

1. Capture the miss — What did the agent do? What did reality say?

2. Diagnose — What didn't it see? Missing data, constraint, feedback, or boundaries?

3. Choose a primitive — Observability, instructions, tooling, guardrails, or verification?

4. Encode as artifact — Version-controlled, repeatable, not just memory.

5. Promote to gate — When it's worth enforcing, make it a gate.

Every harness I setup includes this process in the primary set of agent instructions.


> it is important to note that this is not necessarily what people have in mind when they think of "LLMs generating skills

I’m reading this paper as don’t do this. If you deploy agents to your workforce and tell them to use skills, don’t. Tell them to give it tasks. This sounds obvious but might not be to everyone. (And in any case, it’s nice for researchers to have confirmed pre-prompt skill writing doesn’t work. It would have been neat if it had.)


> I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.

You mean the dude who writes articles on TechCrunch and Ars Technica based off of HN and Reddit thread titles because he doesn't understand what real journalism is? Sure, we can count on him :)


After several failures then a success I have the agent create the skill, next run it is successful first run.

I interpreted it as "Allowing the LLM to add skills to itself as it completes a task doesn't provide a meaningful improvement over just letting it reason normally", which seems to be what the paper is fundamentally getting at.

> I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.

:D


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: