On speed/quality, diffusion has actually moved the frontier. At comparable quality levels, Mercury is >5× faster than similar AR models (including the ones referenced on the AA page). So for a fixed quality target, you can get meaningfully higher throughput.
That said, I agree diffusion models today don’t yet match the very largest AR systems (Opus, Gemini Pro, etc.) on absolute intelligence. That’s not surprising: we’re starting from smaller models and gradually scaling up. The roadmap is to scale intelligence while preserving the large inference-time advantage.
This understates the possible headroom as technical challenges are addressed - text diffusion is significantly less developed than autoregression with transformers, and Inception are breaking new ground.
Very good point- if as much energy/money that's gone into ChatGPT style transformer LLMs were put into diffusion there's a good chance it would outperform in every dimension
This is way too dense, you need to distill your thesis and interesting ideas down to a small post if you expect people spending time reading a 417 page PDF
You're crazy if you think the target demo of "business leaders" and "thought leaders" aren't going to dump it into their favorite LLM first thing and prompt their way into a summary.
The "muse vs. writer" framing is a good start, but the real issue is the source of inspiration. An AI prompted on a blank slate will only ever generate a sophisticated average of its training data. The workflow is broken. A better system doesn't start with "What should I write?" but with "What have I learned?" Using AI to synthesize your unique takeaways from high-signal content you've already consumed—a podcast, a talk—is how you scale authenticity, not just words.
I'm the founder of Castifai.com, which is built for this. It systematizes the "muse" by creating a workflow that starts with content you consume (talks, podcasts) and turns your insights into authentic drafts, solving the input problem.
This isn't a content problem; it's a systems problem. The pressure to create without a pipeline for genuine insights leads to these templates. Authentic thought leadership should be a byproduct of a consumption and synthesis workflow, not a forced, separate task.
I've been working on solving this - first for myself and then for others - by building a tool for this called Castifai. It's a consumption-first workflow that helps turn insights from content you already consume into authentic posts, so you're sharing what you know, not just filling a quota. (I'm the founder). You can try it at castifai.com
However, it is also light on material. I would also like to hear more technical details, they're probably intentionally secretive about it.
But I do, however, understand that building an agent that is highly optimized for your own codebase/process is possible. In fact, I am pretty sure many companies do that but it's not yet in the ether.
Otherwise, one of the most interesting bits from the article was
> Over 1,300 Stripe pull requests (up from 1,000 as of Part 1) merged each week are completely minion-produced, human-reviewed, but containing no human-written code.
I feel like code review is already hard and under done the 'velocity' here is only going to make that worse.
I am also curious how this works when the new crop of junior devs do not have the experience enough to review code but are not getting the experience from writing it.
Agents can already do the review by themselves. I'd be surprised they review all of the code by hand. They probably can't mention it due to the regulatory of the field itself. But from what I have seen agentic review tools are already between 80th and 90th percentile. Out of randomly picked 10 engineers, it will provide more useful comments than most engineers.
the problem with LLM code review is that it's good at checking local consistency and minor bugs, but it generally can't tell you if you are solving the wrong problem or if your approach is a bad one for non-technical reasons.
This is an enormous drawback and makes LLM code review more akin to a linter at the moment.
I mean if the model can reason about making the changes on the large-scale repository then this implies it can also reason about the change somebody else did, no? I kinda agree and disagree with you at the same time, which is why I said most of the engineers but I believe we are heading towards the model being able to completely autonomously write and review its own changes.
There's a good chance that in the long run LLMs can become good at this, but this would require them e.g. being plugged into the meetings and so on that led to a particular feature request. To be a good software engineer, you need all the inputs that software engineers get.
If you read thoroughly through Stripe blog, you will see that they feed their model already with this or similar type of information. Being plugged into the meetings might just mean feed the model with the meeting minutes or let the model listen to the meeting and transcribe the meeting. It seems to me that both of them are possible even as of today.
I was an early MCP hater, but one thing I will say about it is that it's useful as a common interface for secure centralization. I can control auth and policy centrally via a MCP gateway in a way that would be much harder if I had to stitch together API proxies, CLIs, etc to provide capabilities.
reply