They all seem to come from quite old HN profiles, though. So if someone managed to overtake old HN accounts for manipulation .. I would assume it would be for a more lucrative target?
This one is quite good. The author is known, here fielding questions, and the project is like ten years old. If there are bots, I really don’t think they’re coordinated in any way with the OP, only coordinated in the usual “spam HN to get karma” sort of way.
If there’s a scale of “making HN worse,” I’m not sure genuine human skepticism is on it. LLM generated walls of text and garbage Show HN posts sure are at the top though!
I tried it once; it was incredibly verbose, generating an insane amount of files. I stopped using it because I was worried it would not be possible to rapidly, cheaply, and robustly update things as interaction with users generated new requirements.
The best way I have today is to start with a project requirements document and then ask for a step-by-step implementation plan, and then go do the thing at each step but only after I greenlight the strategy of the current step. I also specify minimal, modular, and functional stateless code.
I don't know. Skill+http endpoint feel way safer, powerful and robust. The problem is usually that the entity offering the endpoint, if the endpoint is ai powered, concur in LLM costs. While via mcp the coding agent is eating that cost, unless you are also the one running the API and so can use the coding plan endpoint to do the ai thing
If I didn't misunderstood you, it doesn't really matter if it's an endpoint or a (remote) mcp, either someone else wants to run llms to provide a service for you or they don't.
A local mcp doesn't come in play because they just couldn't offer the same features in this case.
The MCP server usually provides some functions you can run, possibly with some database interaction.
So when you run it, your codign agent is using AI to run that code (what to call, what parameters to pass, and so on). Via MCP, they don't pay any LLM cost; they just offer the code and the endpoint.
But this is usually messy for the coding agent since it fills up the context. While if you use skill + API, it's easier for the agent since there's no code in the context, just how to call the API and what to pass.
With something like this, you can then have very complex things happening in the endpoint without the agent worrying about context rot or being able to deal with that functionality.
But to have that difficult functionality, you also need to call an LLM inside the endpoint, which is problematic if the person offering the MCP service does not want to cover LLM costs.
So it does matter if it's an endpoint or an MCP because the agent is able to do more complex and robust stuff if it uses skill and HTTP.
Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter.
I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time. Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.
It was also the first AI I felt, "Damn, this thing is smarter than me."
The other crazy thing is that with today's tech, these things can be made to work at 1k tokens/sec with multiple agents working at the same time, each at that speed.
I wish I had this kind of experience. I threw a tedious but straightforward task at Claude Code using Opus 4.6 late last week: find the places in a React code base where we were using useState and useEffect to calculate a value that was purely dependent on the inputs to useEffect, and replace them with useMemo. I told it to be careful to only replace cases where the change did not introduce any behavior changes, and I put it in plan mode first.
It gave me an impressive plan of attack, including a reasonable way to determine which code it could safely modify. I told it to start with just a few files and let me review; its changes looked good. So I told it to proceed with the rest of the code.
It made hundreds of changes, as expected (big code base). And most of them were correct! Except the places where it decided to do things like put its "const x = useMemo(...)" call after some piece of code that used the value of "x", meaning I now had a bunch of undefined variable references. There were some other missteps too.
I tried to convince it to fix the places where it had messed up, but it quickly started wanting to make larger structural changes (extracting code into helper functions, etc.) rather than just moving the offending code a few lines higher in the source file. Eventually I gave up trying to steer it and, with the help of another dev on my team, fixed up all the broken code by hand.
It probably still saved time compared to making all the changes myself. But it was way more frustrating.
One tip I have is that once you have the diff you want to fix, start a new session and have it work on the diff fresh. They’ve improved this, but it’s still the case that the farther you get into context window, the dumber and less focused the model gets. I learned this from the Claude Code team themselves, who have long advised starting over rather than trying to steer a conversation that has started down a wrong path.
I have heard from people who regularly push a session through multiple compactions. I don’t think this is a good idea. I virtually never do this — when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I learned recently that even essentials like the CLAUDE.md part of the prompt get diluted through compactions. You can write a hook to re-insert it but it's not done by default.
This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean.
Slight tangent: you want to read the diff between your branch and the merge-base with origin/main. Otherwise you get lots of spurious spam in your diff, if main moved since you branched off.
One thing that seems important is to have the agent write down their plan and any useful memory in markdown files, so that further invocations can just read from it
>"This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean."
You maintain very low context usage in the main thread; just orchestration and planning details, while each individual team member remains responsible for their own. Allows you to churn through millions of output tokens in a fraction of the time.
subagents are huge, could execute on a massive plan that should easily fill up a 200k context window and be done atnaround 60k for the orchestration agent.
as a cheapass, being able to pass off the simple work to cheaper $ per token agents is also just great. I've got a handful of tasks I can happily delegate work to a haiku agent and anything requiring a bit of reasoning goes to sonnet.
Feel like opus is almost a cheatcode when i do get stuck, i just bust out a full opus workflow instead and it just destroys everything i was struggling with usually. like playing on easy mode.
as cool as this stuff is, kinda still wish i was just grandfathered into the plan with no weekly limit and only the 5 hour window limits, id just be happily hammering opus blissfully.
Same here. I don't understand how people leave it running on an "autopilot" for long periods of time. I still use it interactively as an assistant, going back and forth and stepping in when it makes mistakes or questionable architectural decisions. Maybe that workflow makes more sense if you're not a developer and don't have a good way to judge code quality in the first place.
There's probably a parallel with the CMSes and frameworks of the 2000s (e.g. WordPress or Ruby on Rails). They massively improved productivity, but as a junior developer you could get pretty stuck if something broke or you needed to implement an unconventional feature. I guess it must feel a bit similar for non-developers using tools like Claude Code today.
>Same here. I don't understand how people leave it running on an "autopilot" for long periods of time.
Things have changed. The models have reached a level of coherence that they can be left to make the right decisions autonomously. Opus 4.6 is in a class of its own now.
A non-technical client of mine has built an entire app with a very large feature set with Opus. I declined to work on it to clean it up, I was afraid it would have been impossible and too much risk. I think we are at a level where it can build and auto-correct its mistakes, but the code is still slop and kind of dangerous to put in production. If you care about the most basic security.
Branch first so you can just undo. I think this would have worked with sub agents and /loop maybe? Write all items to change to a todo.md. Have it split up the work with haiku sub agents doing 5-10 changes at a time, marking the todos done, and /loop until all are done. You’ll succeed I suspect. If the main claude instance compacts its context - stop and start from where you left off.
It actually did automatically break the work up into chunks and launched a bunch of parallel workers to each handle a smaller amount of work. It wasn't doing everything in a single instance.
The problem wasn't that it lost track of which changes it needed to make, so I don't think checking items off a todo list would have helped. I believe it did actually change all the places in the code it should have. It just made the wrong changes sometimes.
But also, the claim I was responding to was, "I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time." If I have to tell it how to organize its work and how to keep track of its progress and how to execute all the smaller chunks of work, then I may get good results, but the tool isn't as magical (for me, anyway) as it seems to be for some other people.
One of the more subtle points that seems to be crucial is that it works a lot better when it can use the context as part of its own work rather than being polluted by unrelated details. Even better than restarting when it's off the rails is to avoid it as much as possible by proactively starting a new conversation as soon as anything in the history of the existing one stops being relevant. I've found it more effective to manually tell it most what's currently in the context in a fresh session skip the irrelevant bits even if they're fairly small than relying on it to figure out that it's no longer relevant (or give it instructions indicating that, which feels like a crapshoot whether it's actually going to prune or just bloat things further with that instruction just being added into the mix).
To echo what the parent comment said, it's almost frustrating how effective it can be at certain tasks that I wouldn't ever have the patience for. At my job recently I needed to prototype calling some Python code via WASM using the Rust wasmtime engine, and setting up the code structure to have the bytes for the WASM component, the arguments I wanted to pass to the function, and the WIT describing the interface for the function, it was able to fill in all of the boilerplate needed so that the function calls worked properly within a minute or two on the first try; reading through all the documentation and figuring out how exactly which half dozen assorted things I had to import and hook up together in the correct order would have probably taken me an hour at minimum.
I don't have any particular insight on whether or not these tools will become even more powerful over time, and I still have fairly strong concerns about how AI tools will affect society (both in terms of how they're used and the amount of in energy used to produce them in the first place), but given how much the tech industry tends to prioritize productivity over social concerns, I have to assume that my future employment is going to be heavily impacted by my willingness to adopt and use these tools. I can't deny at this point that having it as an option would make me more productive than if I refuse to use it, regardless of my personal opinions on it.
What kinds of things are you building? This is not my experience at all.
Just today I asked Claude using opus 4.6 to build out a test harness for a new dynamic database diff tool. Everything seemed to be fine but it built a test suite for an existing diff tool. It set everything up in the new directory, but it was actually testing code and logic from a preexisting directory despite the plan being correct before I told it to execute.
I started over and wrote out a few skeleton functions myself then asked it write tests for those to test for some new functionality. Then my plan was to the ask it to add that functionality using the tests as guardrails.
Well the tests didn’t actually call any of the functions under test. They just directly implemented the logic I asked for in the tests.
After $50 and 2 hours I finally got something working only to realize that instead of creating a new pg database to test against, it found a dev database I had lying around and started adding tables to it.
When I managed to fix that, it decided that it needed to rebuild multiple docker components before each test and test them down after each one.
After about 4 hours and $75, I managed to get something working that was probably more code than I would have written in 4 hours, but I think it was probably worse than what I would have come up with on my own. And I really have no idea if it works because the day was over and I didn’t have the energy left to review it all.
We’ve recently been tasked at work with spending more money on Claude (not being more productive the metric is literally spending more money) and everyone is struggling to do anything like what the posts on HN say they are doing. So far no one in my org in a very large tech company has managed to do anything very impressive with Claude other than bringing down prod 2 days ago.
Yes I’m using planning mode and clearing context and being specific with requirements and starting new sessions, and every other piece of advice I’ve read.
I’ve had much more luck using opus 4.6 in vs studio to make more targeted changes, explain things, debug etc… Claude seems too hard to wrangle and it isn’t good enough for you to be operating that far removed from the code.
Similar experience. I use these AI tools on a daily basis. I have tons of examples like yours. In one recent instance I explicitly told it in the prompt to not use memcpy, and it used memcpy anyway, and generated a 30-line diff after thinking for 20 minutes. In that amount of time I created a 10-line diff that didn't use memcpy.
I think it's the big investors' extremely powerful incentives manifesting in the form of internet comments. The pace of improvement peaked at GPT-4. There is value in autocomplete-as-a-service, and the "harnesses" like Codex take it a lot farther. But the people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away. This is not a hockey stick curve. It's a log curve.
Bigger context windows are a welcome addition. And stuff like JSON inputs is nice too. But these things aren't gonna like, take your SWE job, if you're any good. It's just like, a nice substitute for the Google -> Stack Overflow -> Copy/Paste workflow.
Most devs aren't very good. That's the reality, it's what we've all known for a long time. AI is trained on their code, and so these "subpar" devs are blown away when they see the AI generate boring, subpar code.
The second you throw a novel constraint into the mix things fall apart. But most devs don't even know about novel constraints let alone work with them. So they don't see these limitations.
Ask an LLM to not allocate? To not acquire locks? To ensure reentrancy safety? It'll fail - it isn't trained on how to do that. Ask it to "rank" software by some metric? It ends up just spitting out "community consensus" because domain expertise won't be highly represented in its training set.
I love having an LLM to automate the boring work, to do the "subpar" stuff, but they have routinely failed at doing anything I consider to be within my core competency. Just yesterday I used Opus 4.6 to test it out. I checked out an old version of a codebase that was built in a way that is totally inappropriate for security. I asked it to evaluate the system. It did far better than older models but it still completely failed in this task, radically underestimating the severity of its findings, and giving false justifications. Why? For the very obvious reason that it can't be trained to do that work.
The people glazing these tools can't design systems. I have this founder friend who I've known for decades, he knows how to code but he isn't really interested in it; he's more interested in the business side and mostly sees programming as a way to make money. Before ChatGPT he would raise money and hire engineers ASAP. When not a founder he would try to get into management roles etc etc. About a year ago he told me he doesn't really write code anymore, and he showed me part of his codebase for this new company he's building. To my horror I saw a 500-line bash script that he claimed he did not understand and just used prompts to edit it.
It didn't need to be a bash script. It could have been written in any scripting language. I presume it started off as a bash script because that's probably what it started out as when he was exploring the idea. And since it was already bash I guess he decided to just keep going with it. But it was just one of those things where I was like, these autocomplete services would never stop and tell you "maybe this 500-line script should be rewritten in python", it will just continue to affirm you, and pile onto the tech debt.
I used to freak out and think my days were numbered when people claimed they stopped writing code. But now I realize that they don't like writing code, don't care about getting better at it, don't know what good code looks like, and would hire an engineer if they could. With that framing, whenever I see someone say "Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter." I know for a fact that "everything" in that persons mind is very limited in scope.
Also, I just realized that there was an em-dash in that comment. So there's that. Wasn't even written by a person.
> I know for a fact that "everything" in that persons mind is very limited in scope.
I agree, and I think it's quite telling what people are impressed by. Someone elsewhere said that Opus 4.6 is a better programmer than they are and... I mean, I kinda believe it, but I think it says way more about them than it does about Opus.
> people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away
Careful, or you're going to get slapped by the stupid astroturfing rule... but you're correct. Also there's the sunk cost fallacy, post purchase rationalization, choice supportive bias, hell look at r/MyBoyfriendIsAI... some people get very attached to these bots, they're like their work buddies or pets, so you don't even need to pay them, they'll glaze the crap out it themselves.
Curious what language and stack. And have people at your company had marginally more success with greenfield projects like prototypes? I guess that’s what you’re describing, though it sounds like it’s a directory in a monorepo maybe?
This was in Go, but my org also uses Typescript, and Elixir.
I’ve had plenty of success with greenfield projects myself but using the copilot agent and opus 4.5
and 4.6. I completely vibecoded a small game for my 4 year old in 2 hours. It’s probably 20% of the way to being production ready if I wanted to release it, but it works and he loves it.
And yes people have had success with very simple prototypes and demos at work.
You probably just don't have the hang of it yet. It's very good but it's not a mind reader and if you have something specific you want, it's best to just articulate that exactly as best you can ("I want a test harness for <specific_tool>, which you can find <here>"). You need to explain that you want tests that assert on observable outcomes and state, not internal structure, use real objects not mocks, property based testing for invariants, etc. It's a feedback loop between yourself and the agent that you must develop a bit before you start seeing "magic" results. A typical session for me looks like:
- I ask for something highly general and claude explores a bit and responds.
- We go back and forth a bit on precisely what I'm asking for. Maybe I correct it a few times and maybe it has a few ideas I didn't know about/think of.
- It writes some kind of plan to a markdown file. In a fresh session I tell a new instance to execute the plan.
- After it's done, I skim the broad strokes of the code and point out any code/architectural smells.
- I ask it to review it's own work and then critique that review, etc. We write tests.
Perhaps that sounds like a lot but typically this process takes around 30-45 minutes of intermittent focus and the result will be several thousand lines of pretty good, working code.
I absolutely have the hang of Claude and I still find that it can make those ridiculous mistakes, like replicating logic into a test rather than testing a function directly, talking to a local pg that was stale/ running, etc. I have a ton of skills and pre-written prompts for testing practices but, over longer contexts, it will forget and do these things, or get confused, etc.
You can minimize these problems with TLC but ultimately it just will keep fucking up.
Don't know what to tell you. Sounds like you're holding it wrong. Based on the current state of things I would try to get better at holding it the right way.
My favorite is when you need to rebuild/restart outside of claude and it will "fix the bug" and argue with you about whether or not you actually rebuilt and restarted whatever it is you're working on. It would rather call you a liar than realize it didn't do anything.
this is a pretty annoying problem -- i just intentionally solve it by asking claude to always use the right build command after each batch of modifications, etc
With the back and forth refining I find it very useful to tell Claude to 'ask questions when uncertain' and/or to 'suggest a few options on how to solve this and let me choose / discuss'
This has made my planning / research phase so much better.
Yes pretty much my workflow. I also keep all my task.md files around as part of the repo, and they get filled up with work details as the agent closes the gates. At the end of each one I update the project memory file, this ensures I can always resume any task in a few tokens (memory file + task file == full info to work on it).
I mean I’ve been using AI close to 4 years now and I’ve been using agents off and on for over a year now. What you’re describing is exactly what I’m doing.
I’m not seeing anyone at work either out of hundreds of devs who is regularly cranking out several thousand lines of pretty good working code in 30-45 minutes.
What’s an example of something you built today like this?
Fair, that's optimistic, and it depends what you're doing. Looking at a personal project I had a PR from this week at +3000 -500 that I feel quite good about, took about 2 nights of about an hour each session to shape it into what I needed (a control plane for a polymarket trading engine). Though if I'm being fair, this was an outlier, only possible because I very carefully built the core of the engine to support this in advance - most of the 3K LoC was "boilerplate" in the sense I'm just manipulating existing data structures and not building entirely new abstractions. There are definitely some very hard-fought +175 -25 changes in this repo as well.
Definitely for my day job it's more like a few hundred LoC per task, and they take longer. That said, at work there are structural factors preventing larger changes, code review, needing to get design/product/coworker input for sweeping additions, etc. I fully believe it would be possible to go faster and maintain quality.
Those numbers are much more believable, but now we’re well into maybe a 2-3x speed up. I can easily write 500 LOC in an hour if I know exactly what I’m building (ignoring that LOC is a terrible metric).
But now I have to spend more time understanding what it wrote, so best case scenario we’re talking maybe a 50% speed up to a part of my job that I spent maybe 10-20% on.
Making very big assumptions that this doesn’t add long term maintenance burdens or result in a reduction of skills that makes me worse at reviewing the output, it’s cool technology.
On par with switching to a memory managed language or maybe going from J2EE to Ruby on Rails.
Thinking in terms of a "speed up multiplier" undersells it completely. The speed up on a task I would have never even attempted is infinite. For my +3000 PR recently on my polymarket engine control plane, I had no idea how these types of things are typically done. It would have taken me many hours to think through an implementation and hours of research online to assemble an understanding on typical best practices. Now with AI I can dispatch many parallel agents to examine virtually all all public resources for this at once.
Basically if it's been done before in a public facing way, you get a passable version of that functionality "for free". That's a huge deal.
1. You think you have something following typical best practices. You have no way to verify that without taking the time to understand the problem and solution yourself.
2. If you’d done 1, you’d have the knowledge yourself next time the problem came up and could either write it yourself or skip the verifications step.
I’m not saying there aren’t problems out there where the problem is hard to solve but easy to verify. And for those use cases LLMs are terrific.
But many problems have the inverse property. And many problems that look like the first type are actually the second.
LLMs are also shockingly good at generating solutions that look plausible, independent of correctness or suitability, so it’s almost always harder to do the verification step than it seems.
The control plane is already operational and does what I need. Copying public designs solved a few problems I didn't even know I had (awkward command and control UX) and seems strictly superior to what I had before. I could have taken a lot longer on this - probably at least a week, to "deeply understand the problem and solution". But it's unclear what exactly that would have bought me. If I run into further issues I will just solve them at that time.
So what is the issue exactly? This pattern just seems like a looser form of using a library versus building from scratch.
I find that Opus misses a lot of details in the code base when I want it to design a feature or something. It jumps to a basic solution which is actually good but might affect something elsewhere.
GPT 5.4 on codex cli has been much more reliable for me lately. I used to have opus write and codex review, I now to the opposite (I actually have codex write and both review in parallel).
So on the latest models for my use case gpt > opus but these change all the time.
Edit: also the harness is shit. Claude code has been slow, weird and a resource hog. Refuses to read now standardized .agents dirs so I need symlink gymnastics. Hides as much info as it can… Codex cli is working much better lately.
Codex CLI is so much more pleasant to use than CC. I cancelled my CC subscription after the OpenCode thing, but somewhat ironically have recently found myself naturally trying the native Codex CLI client first more often over OpenCode.
Kinda funny how you don't actually need to use coercion if you put in the engineering work to build a product that's competitive on its own technical merits...
> If you're not using AI you are cooked. You just don't realize it yet.
Truth. But not just “using”.
Because here’s where this ship has already landed: humans will not write code, humans will not review code.
I see mostly rage against this idea, but it is already here. Resistance is futile. There will be no “hand crafted software” shops. You have at most 3-4 years left if you think this is your job.
People should still understand the code because sometimes the AI solution really is wrong and I have to shove my hand in it's guts and force it to use my solution or even explain the reasoning.
People should be studying architecture. Cause now I can orchestrate stuff that used to take teams and I would throwaway as a non-viable idea. Now I can just do it. But no you will still be reviewing code.
My experience is that it gets you 80-90% of the way at 20x the speed, but coaxing it into fixing the remaining 10-20% happens at a staggeringly slow speed.
All programming is like this to some extent, but Claude's 80/20 behavior is so much more extreme. It can almost build anything in 15-30 minutes, but after those 15-30 minutes are up, it's only "almost built". Then you need to spend hours, days, maybe even weeks getting past the "almost".
Big part of why everyone seems to be vibe coding apps, but almost nobody seems to be shipping anything.
I had Opus 4.6 running on a backend bug for hours. It got nowhere. Turned out the problem was in AWS X-ray swizzling the fetch method and not handling the same argument types as the original, which led to cryptic errors.
I had Opus 4.6 tell me I was "seeing things wrong" when I tried to have it correct some graphical issues. It got stuck in a loop of re-introducing the same bug every hour or so in an attempt to fix the issue.
I'm not disagreeing with your experience, but in my experience it is largely the same as what I had with Opus 4.5 / Codex / etc.
Haha, reminds me of an unbelievably aggravating exchange with Codex (GPT 5.4 / High) where it was unflinchingly gaslighting me about undesired behavior still occurring after a change it made that it was adamant simply could not be happening.
It started by insisting I was repeatedly making a typo and still would not budge even after I started copy/pasting the full terminal history of what I was entering and the unabridged output, and eventually pivoted to darkly insinuating I was tampering with my shell environment as if I was trying to mislead it or something.
Ultimately it turned out that it forgot it was supposed to be applying the fixes to the actual server instead of the local dev environment, and had earlier in the conversation switched from editing directly over SSH to pushing/pulling the local repo to the remote due to diffs getting mangled.
I am starting to believe it’s not OPUS but developers getting better at using LLMs across the board. And not realizing they are just getting much better at using these tools.
I also thought it was OPUS 4.5 (also tested a lot with 4.6) and then in February switched to only using auto mode in the coding IDEs. They do not use OPUS (most of the times), and I’m ending up with a similar result after a very rough learning curve.
Now switching back to OPUS I notice that I get more out of it, but it’s no longer a huge difference. In a lot of cases OPUS is actually in the way after learning to prompt more effectively with cheaper models.
The big difference now is that I’m just paying 60-90$ month for 40-50hrs of weekly usage… while I was inching towards 1000$ with OPUS. I chose these auto modes because they don’t dig into usage based pricing or throttling which is a pretty sweet deal.
Maybe so, but personally it seemed to be referred to as a "specification" or "spec" for a long time, and then suddenly around maybe 5 years ago I started to hear people use "PRD". I'm not sure what caused the change.
Seems commonly used in Big Tech - first time I heard it was in my current job. Now it's seared into my brain since it's used so much. Among many other acronyms which I won't bore you with.
I had been able to get it into the classic AI loop once.
It was about a problem with calculation around filling a topographical water basin with sedimentation where calculation is discrete (e.g. turn based) and that edge case where both water and sediments would overflow the basin; To make the matter simple, fact was A, B, C, and it oscillated between explanation 1 which refuted C, explanation 2 which refuted A and explanation 3 that refuted B.
I'll give it to opus training stability that my 3 tries using it all consistently got into this loop, so I decided to directly order it to do a brute force solution that avoided (but didn't solve) this problem.
I did feel like with a human, there's no way that those 3 loop would happen by the second time. Or at least the majority of us. But there is just no way to get through to opus 4.6
Opus helped me brick my RPi CM4 today. It glibly apologized for telling to use an e instead of a 6 in a boot loader sequence.
Spent an hour or so unraveling the mess. My feeling are growing more and more conflicted about these tools. They are here to stay obviously.
I’m honestly uncertain about the junior engineers I’m working with who are more productive than they might be otherwise, but are gaining zero (or very little) experience. It’s like the future is a world where the entire programming sphere is dominated by the clueless non technical management that we’ve all had to deal with in small proportion a time or two.
> I’m honestly uncertain about the junior engineers I’m working with who are more productive than they might be otherwise, but are gaining zero (or very little) experience.
Well, (economic) progress means being able to do more with less. A Fordian-style conveyor belt factory can churn out cars with relatively unskilled labour.
Economising on human capital is economising on a scarce input.
We had these kinds of shifts before. Compare also how planes used to have a pilot, copilot and flight engineer. We don't have that anymore, but it used to be a place for people to learn. But pilot education has adapted.
Or check how spreadsheet software has removed a lot of the worst rote work in finance. That change happened perhaps in the 1980s. Finance has adapted.
> Opus helped me brick my RPi CM4 today. It glibly apologized for telling to use an e instead of a 6 in a boot loader sequence.
Yes, these things do best when they have a (simulated) environment they can make mistakes in and that can give them clear and fast feedback.
> Yes, these things do best when they have a (simulated) environment they can make mistakes in and that can give them clear and fast feedback.
This always felt like a reason to throw it at coding. With its rigid syntax you'll know quickly and cheaply if what was written passes an absolute minimaal level of quality.
Anecdotal from my own experience. But someone might have done a study by now: they are much cheaper and quicker to run on AIs than on undergrad students (yet alone professional devs).
Also pre-AI: when I set up nice property tests, I could develop much better even while a bit tired.
I've seen a few instances of where Claude showed me a better way to do something and many many more instances of where it fails miserably.
Super simple problem :
I had a ZMK keyboard layout definition I wanted it to convert it to QMK for a different keyboard that had one key less so it just had to trim one outer key. It took like 45 minutes of back and forth to get it right - I could have done it in 30 min manually tops with looking up docs for everything.
Capability isn't the impressive part it's the tenacity/endurance.
Codex doesn't seem that far behind. I use the top model available for api key use and its gotten faster this month even on the max effort level (not like a cheetah - more like not so damn painful anymore). Plus, it also forks agents in parallel - for speed & to avoid polluting the main context. I.e. it will fork explorer agents while investigating (kind of amusing because they're named after famous scientists).
We're never getting UBI. See the latest interview with the Palantir CEO where he talks about white collar workers having to take more hands-on jobs that they may not feel as satisfied with. IE - tending their manors and compounds.
RIP widespread middle class. It was a good 80-year run.
We know this to be true with a reasonably degree of certainty
> likely a society
This one, not so much. We could potentially have pretty vibrant societies even if everyone is not ultra rich, not going on international vacations, not having access to buy things from the other end of the world subsidized by economies of scale.
Don't confuse UBI and employment or even income though. If we find ourselves replacing or exceeding current productivity without humans working in the system we have to fundamentally rethink our system.
You likely wouldn't need money at all in that future, for example. What does the money really mean when everyone I'd guaranteed to have all the basics covered? Is money really helping to store value created via labor when there is no labor? And is money providing price discover when the cost of resources and manufacturing are moving towards zero?
If labor is replaced with tech, and I think that's a big if, I don't see any outcome other than a totalitarian distopia that will fail much like the Soviet Union.
The replacement of human labor with tech is speculation. I don't see any way a future where we have a UBI because humans no longer work for a living ends well.
Sure I'm talking the future so its speculative, but I'd love to hear a scenario where it works well sustainably and doesn't turn into a totalitarian distopia.
I’ll put out a suggestion you pair with codex or deepthink for audit and review - opus is still prone to … enthusiastic architectural decisions. I promise you will be at least thankful and at most like ‘wtf?’ at some audit outputs.
Also shout out to beads - I highly recommend you pair it with beads from yegge: opus can lay out a large project with beads, and keep track of what to do next and churn through the list beautifully with a little help.
Nice. Yeah I have them connect through beads, which combined with a git log is a lot of information - it feels smoother to me than this looks. But I agree with the sentiment. Codex isn't my favorite for understanding and implementing. But I appreciate the intelligence and pickiness very much.
Just yesterday I asked it to repeat a very simple task 10 times. It ended up doing it 15 times. It wasn't a problem per se, just a bit jarring that it was unable to follow such simple instructions (it even repeated my desire for 10 repetitions at the start!).
Opus 4.6 is AGI in my book. They won’t admit it, but it’s absolutely true. It shows initiative in not only getting things right but also adding improvements that the original prompt didn't request that match the goals of the job.
Not even close. There are still tons of architectural design issues that I'd find it completely useless at, tons of subtle issues it won't notice.
I never run agents by themselves; every single edit they do is approved by me. And, I've lost track of the innumerable times I've had to step in and redirect them (including Opus) to an objectively better approach. I probably should keep a log of all that, for the sake of posterity.
I'll grant you that for basic implementation of a detailed and well-specced design, it is capable.
I don’t know if Opus is AGI but on a broader note, that’s how we will get AGI. Not some consciousness like people are expecting. It’s just going to be chatbot that’s very hard to stump and starts making actual scientific breakthroughs and solving long standing problems.
I'll be more likely to agree with anything being AGI if it doesn't have such obvious and common brittleness. These LLMs all go off the rails when the context window gets large. Their context is also easy to "poison", and so it's better to rollback conversations that went bad rather than trying to steer them back to the light.
There's probably more examples, but to me AGI must move beyond the above issues. Though frankly context window might just be a symptom of poor harness than anything, still - it illustrates my general issue with them being considered AGI as it stands today.
Claude 4.6 is getting crazy good though, i'll give you that.
I'm late to the party and I'm just getting started with Antrophic models. I have been finding Sonnet decent enough, but it seems to have trouble naming variables correctly (it's not just that most names are poor and undescriptive, sometimes it names it wrong, confusing) or sometimes unnecessarily declaring, re-declaring variables, encoding, decoding, rather than using the value that's already there etc. Is Opus better at this?
The replies to this really make me think that some people are getting left behind the AI age. Colleges are likely already teaching how to prompt, but a lot of existing software devs just don't get it. I encourage people who aren't having success with AI to watch some youtube videos on best practices.
I'm working in a codebase of 200+ "microservices", separate repos, each deployed as multiple FaaS, CQRS-style. None of it my choice, everything precedes me, many repos I know nothing of.
Little to no code re-use between them.
Any trace of "business logic" is so distributed in multiple repos, that I have no possible use of LLM codegen, unless I can somehow feed it ALL the codebase.
I've tried generating some tests, but they always miss the mark, as in the system under test is almost always the wrong one.
I guess LLM are cool for greenfield, but when the brownfield is really brown, there's no use for LLMs.
Okay the process is simple. You're going to go to another website, called YouTube. Don't be alarmed. First read all the steps so you don't miss any, once you start going to the other site you won't be able to see this one. You might want to write these down on a piece of paper first. Okay here we go:
1. Click in the bar at the top of the page that says ycombinator.com
2. type this in: youtube.com
3. press enter
4. There will be a box at the top that says "search", click that
5. type in "tips and tricks for agentic coding"
6. press enter
7. a list of videos should appear, watch them all
Yes, this can happen. Some people just have the worst luck, and will always find the bad one. This is why Greene made "stay away from unlucky people" law 10 out of 48. It's really that important.
I have no idea how much the demand is for fine-tuned models. Is it big? Are people actively looking for endpoints for fine-tuned models? Why? Mostly out of curiosity, I personally never had the need.
What I want from an LLM is smart, super cheap, fast, and private. I wonder if we will ever get there. Like having a cheap Cerebras machine at home with oss 400B models on it.
For consumers, we want to just pass on price to performance ratio. For enthusiasts and companies, we do see people want their own models/ ability to use the massive amounts of data they have.
I wonder what happens if I apply the same strategy to an automated shop. Claude code periodically proposes updates and automatically implements them, with revenue as the target function.I'll give it a try.
reply