> The idea that new code is better than old is patently absurd. Old code has been used. It has been tested. Lots of bugs have been found, and they’ve been fixed.
This quote is completely and totally irrelevant. Nobody is saying they should code a new Outlook. If they did code something, it would be significantly smaller in scope and rigorously tested like spacebound programs in the past were. "New space-engineering-grade code created with actual engineering practices" is absolutely going to be more reliable than "old bloated commercial shitware". But I guess software engineering is a lost art, so it can't be helped.
It's also going to take a hell of a lot longer and cost more than buying an Outlook license. If I was lead on that project, you'd have an uphill battle trying to convince me that spending $100k+ on an email solution unless you can point to specific, serious deficiencies in the existing off the shelf solutions.
Software Engineering is far from a lost art: part of the practice is intelligently making cost-benefit decisions.
The current solution is literally causing problems in space. Space-grade engineering is expensive, but having things go wrong on your already very expensive mission is even more expensive.
Sure, but people who didn't know better until this particular incident do not deserve the title "engineer". Being able to classify and manage risks before they happen is engineering 101.
That problem would be much less likely with a minimalist battle tested OSS solution whose maintainers and users have decidedly different priorities than those governing something like outlook or even thunderbird.
The higher the stakes the more valuable minimalism becomes.
Actually, this could be a case where its useful. Even it only catches half the complaints, that's still a lot of data, far more than ordinary telemetry used to collect.
Opus doubled in speed with version 4.5, leading me to speculate that they had promoted a sonnet size model. The new faster opus was the same speed as Gemini 3 flash running on the same TPUs. I think anthropics margins are probably the highest in the industry, but they have to chop that up with google by renting their TPUs.
People used to bet on ships sinking and sailors drowning.
Till they learned better.
Edit:
This was common until Parliament passed the Marine Insurance Act of 1745.
Before that, speculators could take out "wagering policies" on vessels they had no connection to. This created "coffin ships" - unseaworthy vessels sent to sea because the insurance payout for a wreck was worth more than the ship itself. The law introduced "insurable interest," meaning you cannot bet on a disaster unless you stand to lose something if it happens. This removed the incentive for sabotage and murder for profit.
Modern prediction markets are heading toward the same problem. Betting on train delays or bridge collapses without having any stake gives bad actors a reason to cause it. If the cost of sabotage is lower than the payout, the market effectively pays for the disaster to happen.
That was far crazier than I expected going into it... To the point I've seen Hollywood movies with far more believable plots that people would find unrealistic.
I do this too, but then you need some method to handle it, because now you have to read and test and verify multiple work streams. It can become overwhelming. In the past week I had the following problems from parallel agents:
Gemini running an benchmark- everything ran smoothly for an hour. But on verification it had hallucinated the model used for judging, invalidating the whole run.
Another task used Opus and I manually specified the model to use. It still used the wrong model.
This type of hallucination has happened to me at least 4-5 times in the past fortnight using opus 4.6 and gemini-3.1-pro. GLM-5 does not seem to hallucinate so much.
So if you are not actively monitoring your agent and making the corrections, you need something else that is.
You need a harness, yes, and you need quality gates the agent can't mess with, and that just kicks the work back with a stern message to fix the problems. Otherwise you're wasting your time reviewing incomplete work.
Your point being? A proper harness will mostly catch things like that. Even a low end model can be employed to do write tests plans and do consistency checks that mostly weed out stuff like that. Hence: You need a harness, or you'll spend your time worrying about dumb stuff like this.
Glancing at what it's doing is part of your multitasking rounds.
Also instead of just prompting, having it write a quick summary of exactly what it will do where the AI writes a plan including class names branch names file locations specific tests etc. is helpful before I hit go, since the code outline is smaller and quicker to correct.
That takes more wall clock time per agent, but gets better results, so fewer redo steps.
reply