Hacker Newsnew | past | comments | ask | show | jobs | submit | nylonstrung's commentslogin

It sounds like a "cursed problem". Are there any contemporary techniques that show any promise?

Great article. I foresee people rediscovering 'Test Driven Development', probably with a new buzzword slapped on it

Reinforcement learning with program feedback

Agentic Reassurance Patterns

I'm not sold on diffusion models.

Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases

Here's more detail on how price/performance stacks up

https://artificialanalysis.ai/models/mercury-2


I’d push back a bit on the Pareto point.

On speed/quality, diffusion has actually moved the frontier. At comparable quality levels, Mercury is >5× faster than similar AR models (including the ones referenced on the AA page). So for a fixed quality target, you can get meaningfully higher throughput.

That said, I agree diffusion models today don’t yet match the very largest AR systems (Opus, Gemini Pro, etc.) on absolute intelligence. That’s not surprising: we’re starting from smaller models and gradually scaling up. The roadmap is to scale intelligence while preserving the large inference-time advantage.


This understates the possible headroom as technical challenges are addressed - text diffusion is significantly less developed than autoregression with transformers, and Inception are breaking new ground.

Very good point- if as much energy/money that's gone into ChatGPT style transformer LLMs were put into diffusion there's a good chance it would outperform in every dimension

I changed my mind: this would be perfect for a fast edit model ala Morph Fast Apply https://www.morphllm.com/products/fastapply

It looks like they are offering this in the form of "Mercury Edit"and I'm keen to try it


It's extremely similar to the fake "agentic" crypto plays a year ago

Where Goatseus Maximus and stuff supposedly created coins and invested autonomously.

Obviously it was BS but it fueled a huge amount of attention and speculation


I hate these Lovable-generated slopsites

BAR is incredible, probably best RTS right now

This is way too dense, you need to distill your thesis and interesting ideas down to a small post if you expect people spending time reading a 417 page PDF

You're crazy if you think the target demo of "business leaders" and "thought leaders" aren't going to dump it into their favorite LLM first thing and prompt their way into a summary.

So much water and resources being wasted by "thought leaders" posting performative BS on LinkedIn (just count "It is not X, it is Y" style posts).

The "muse vs. writer" framing is a good start, but the real issue is the source of inspiration. An AI prompted on a blank slate will only ever generate a sophisticated average of its training data. The workflow is broken. A better system doesn't start with "What should I write?" but with "What have I learned?" Using AI to synthesize your unique takeaways from high-signal content you've already consumed—a podcast, a talk—is how you scale authenticity, not just words.

I'm the founder of Castifai.com, which is built for this. It systematizes the "muse" by creating a workflow that starts with content you consume (talks, podcasts) and turns your insights into authentic drafts, solving the input problem.


This isn't a content problem; it's a systems problem. The pressure to create without a pipeline for genuine insights leads to these templates. Authentic thought leadership should be a byproduct of a consumption and synthesis workflow, not a forced, separate task. I've been working on solving this - first for myself and then for others - by building a tool for this called Castifai. It's a consumption-first workflow that helps turn insights from content you already consume into authentic posts, so you're sharing what you know, not just filling a quota. (I'm the founder). You can try it at castifai.com

directionally correct but important to note the water wasted by sustaining the insufferable human is much higher than producing the tokens

I'm not the author, I just got sent the link by someone else :)

This wasn't a16z monolithically speaking as a firm, it was Anish Acharya talking on a podcast.

Seems like he's focused on fintech and not involved in many of their LLM investments


It has all the trappings of NIH syndrome.

Reinventing the wheel without explaining why existing tools didn't work

Creating buzzwords ("blueprints" "devboxes") for concepts that are not novel and already have common terms

Yet they embrace MCP of all things as a transport layer- the one part of the common "agentic" stack that genuinely sucks and needs to be reinvented


They mention "Why did we build it ourselves" in the part1 series: https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-...

However, it is also light on material. I would also like to hear more technical details, they're probably intentionally secretive about it.

But I do, however, understand that building an agent that is highly optimized for your own codebase/process is possible. In fact, I am pretty sure many companies do that but it's not yet in the ether.

Otherwise, one of the most interesting bits from the article was

> Over 1,300 Stripe pull requests (up from 1,000 as of Part 1) merged each week are completely minion-produced, human-reviewed, but containing no human-written code.


"human reviewed"

"LGTM..."

I feel like code review is already hard and under done the 'velocity' here is only going to make that worse.

I am also curious how this works when the new crop of junior devs do not have the experience enough to review code but are not getting the experience from writing it.

Time will tell I guess.


Agents can already do the review by themselves. I'd be surprised they review all of the code by hand. They probably can't mention it due to the regulatory of the field itself. But from what I have seen agentic review tools are already between 80th and 90th percentile. Out of randomly picked 10 engineers, it will provide more useful comments than most engineers.

the problem with LLM code review is that it's good at checking local consistency and minor bugs, but it generally can't tell you if you are solving the wrong problem or if your approach is a bad one for non-technical reasons.

This is an enormous drawback and makes LLM code review more akin to a linter at the moment.


I mean if the model can reason about making the changes on the large-scale repository then this implies it can also reason about the change somebody else did, no? I kinda agree and disagree with you at the same time, which is why I said most of the engineers but I believe we are heading towards the model being able to completely autonomously write and review its own changes.

There's a good chance that in the long run LLMs can become good at this, but this would require them e.g. being plugged into the meetings and so on that led to a particular feature request. To be a good software engineer, you need all the inputs that software engineers get.

If you read thoroughly through Stripe blog, you will see that they feed their model already with this or similar type of information. Being plugged into the meetings might just mean feed the model with the meeting minutes or let the model listen to the meeting and transcribe the meeting. It seems to me that both of them are possible even as of today.

What are the common terms for those? (I have heard "devbox" across multiple companies, and I'm not in the LLM world enough to know the other parts.)

I was an early MCP hater, but one thing I will say about it is that it's useful as a common interface for secure centralization. I can control auth and policy centrally via a MCP gateway in a way that would be much harder if I had to stitch together API proxies, CLIs, etc to provide capabilities.

>Reinventing the wheel without explaining why existing tools didn't work

Won‘t that be the nee normal with all those AI agents?

No frameworks, no libraries, just let AI create everything from scratch again


resume driven development

Agree that routing is becoming the critical layer here. Vllm iris is really promising for this https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html

There's already some good work on router benchmarking which is pretty interesting


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: