For a real world example of the challenges of harnessing LLMs, look at Apple. Over a year ago they had a big product launch focused on "Apple Intelligence" that was supposed to make heavy use of LLMs for agentic workflows. But all we've really gotten since then are a couple of minor tools for making emojis, summarizing notifications, and proof reading. And they even had to roll back the notification summaries for a while for being wildly "out of control". [1] And in this year's iPhone launch the AI marketing was toned down significantly.
I think Apple execs genuinely underestimated how difficult it would be to get LLMs to perform up to Apple's typical standards of polish and control.
> to perform up to Apple's typical standards of polish and control.
i no longer believe they have kept on to the standards in general. the ux/ui used to be a top priority, but the quality control has certainly gone down over the years [1]. the company is now driven by supply chain and business-minded optimizations than what to give to the end user.
at the same time, what one can do using AI has large correlation with what one does with their devices in the first place. a windows recall like feature for ipad os might have been interesting (if not equally controversial), but not that useful because even till this day it remains quite restrictive for most atypical tasks.
>> to perform up to Apple's typical standards of polish and control.
>i no longer believe they have kept on to the standards in general.
One 100% agree with this, if I compare AI's ability to speed up the baseline for me in terms of programming Golang (hard/tricky tasks clearly still require human input - watch out for I/O ops) with Apple's lack of ability to integrate it in even the simplest of ways.. things are just peculiar on the Apple front. Bit similar to how MS seems to be gradually loosing the ability to produce a version of Windows that people want to run due to organisational infighting.
Personally, I’ve never seen an AI flow of any kind that meets what would meet the quality of a typical ‘corporate’ acceptable flow. As in, reliably works, doesn’t go crazy randomly, etc.
I’ve seen a lot of things that look like they’re working for a demo, but shortly after starting to use it? Trash. Not every time (and it’s getting a little better), but often enough that personally I’ve found them a net drain on productivity.
And I literally work in this space.
Personally, I find apples hesitation here a breath of fresh air, because I’ve come to absolutely hate Windows - and everybody doing vibe code messes that end up being my problem.
> Personally, I find apples hesitation here a breath of fresh air
i does not appear to me as hesitation but rather an example of how they were unable to recently deliver on their marketing promises.
calling a suite of incomplete features as "Apple Intelligence" means that they had much higher expectations internally, similar to how they refined as second-movers in other instances. they have a similar situation with XR now.
> I’ve never seen an AI flow of any kind that meets what would meet the quality of a typical ‘corporate’ acceptable flow. As in, reliably works, doesn’t go crazy randomly, etc.
Jump [1] built a multi-million dollar business exactly on this, a service used by corporations in financial consultancy.
The regular ChatGPT 5 seems pretty reliable to me? I ~never get crazy output unless I'm pasting a jailbreak prompt I saw on twitter. It might not always meet my standards, but that's true of a lot of things.
Maybe not the same thing, but chatgpt 5 was driving me insane in visual studio co pilot last week. I seemingly could stop it from randomly changing bits of code, to the point where it was apologising then doing the same in next change even when told not to.
I've now changed to asking where things are in the code base and how they work then making changes myself.
With Apple it's incredibly obvious that most software product development is nowadays handled by outsourced/offshored contractors who simply do not use the products. At least I hope that's the case, it would be disastrous if the state of iOS/watchOS is the result of their in-house on-shore talent.
It's such a testament to how good they used to be, that years and years of dropping the ball still leaves them better than everyone else. Maybe they were actually just much better than anyone was willing to pay for, and the market just didn't reward the attention to detail
Like most AI products it feels like they started with a solution first and went searching for the problems. Text messages being too long wasn't a real problem to begin with.
There are some good parts to Apple Intelligence though. I find the priority notifications feature works pretty well, and the photo cleanup tool works pretty well for small things like removing your finger from the corner of a photo, though it's not going to work on huge tasks like removing a whole person from a photo.
> it's not going to work on huge tasks like removing a whole person from a photo.
I use it for removing people who wander into the frame quite often. It probably wont work for someone close up, but its great for removing a tourist who spends ten minutes taking selfies in front of a monument.
Honestly I love the priority notifications and the notification summaries. The thing that drives me absolutely insane, is that the fact that when I view the notification through clicking on it from another space other than the "While in the reduce interruptions focus" it doesn't clear. Because of this, I always have infinite notifications.
I want to open WhatsApp and open the message and have it clear the notif. Or atleast click the notif from the normal notif center and have it clear there. It kills me
Really those should have been filtered out by the spam filter. If it's made it all the way to your inbox it's not surprising it got marked as a priority since phishing emails are written to look urgent, something which if real would be a priority notification.
Do you know if apple is using their new tools to do mail filtering? It's an interesting choice if they are since it's a genuine problem with a mature (but always evolving) solution.
Well yeah, but that's in part a problem with always-on doorbell cameras. On paper they're illegal in many countries (privacy laws, you can't just put up a camera and record anyone out in public), in practice the police asks people to put their doorbell cameras in a registry so they can request footage if needs be.
Anyway, I get wanting to see who's ringing your doorbell in e.g. apartment buildings, and that extending to a house, especially if you have a bigger one. But is there a reason those cameras need to be on all the time?
At least in the USA it’s legal to record public spaces. So recording the street and things that can be seen from it is legal, but pointing a camera over your neighbors fence is not.
And a lot of people don't share that opinion, so this isn't the law in a lot of countries. When you wanted to suggest that it is a problem, that US companies try to extend the law of there home country to other parts of the world, then I endorse that.
it isn't creepy, it's super annoying if you don't live in the woods. got a ring doorbell and turned them off a few hours after installation, it was driving me nuts.
That makes... That makes just enough sense to become nonsense, rather than mere noise.
I mean, I could imagine a person with no common sense almost making the same mistake: "I have a list of 5 notifications of a person standing on the porch, and no notifications about leaving, so there must be a 5 person group still standing outside right now. Whadya mean, 'look at the times'?"
> A biologist, a physicist and a mathematician were sitting in a street cafe watching the crowd. Across the street they saw a man and a woman entering a building. Ten minutes they reappeared together with a third person.
> - They have multiplied, said the biologist.
> - Oh no, an error in measurement, the physicist sighed.
> - If exactly one person enters the building now, it will be empty again, the mathematician concluded.
It does feel like somebody forgot that "from the first sentence or two of the email, you can tell what it's about" was already a rule of good writing...
Maybe they remembered that a lot of people aren't actually good writers. My brother will send 1000 word emails that meander through subjects like what he ate for breakfast to eventually get to the point of scheduling a meeting about negotiating a time for help with moving a sofa. Mind you, I see him several times a week so he's not lonely, this is just the way he writes. Then he complains endlessly about his coworkers using AI to summarize his emails. When told that he needs to change how he writes to cut right to the point, he adopts the "why should I change, they're the ones who suck" mentality.
So while Apple's AI summaries may have been poorly executed, I can certainly understand the appeal and motivation behind such a feature.
I mean...this depends very heavily on what the purpose of the writing is.
If it's to succinctly communicate key facts, then you write it quickly.
- Discovered that Bilbo's old ring is, in fact, the One Ring of Power.
- Took it on a journey southward to Mordor.
- Experienced a bunch of hardship along the way, and nearly failed at the end, but with Sméagol's contribution, successfully destroyed the Ring and defeated Sauron forever.
....And if it's to tell a story, then you write The Lord of the Rings.
Now, that's very true! But it's a far cry from implying that all or most humanities teachers are all about writing florid essays when 3 bullet points will do.
There’s a thread here that could be pulled - something about using AI to turn everyone into exactly who you want to communicate with in the way you want.
Probably a sci-fi story about it, if not, it should be written.
I think people read texts because they want to read them, and when they don't want to read the texts they are also not even interested in reading the summaries.
Why do I think this? ...in the early 2000's my employer had a company wide license for a document summarizer tool that was rather accurate and easy to use, but nobody ever used it.
The obvious use case is “I don’t want to read this but I am required to read this (job)” - the fact that people don’t want to use it even there is telling, imo.
Even bending over that far backwards to find a useful example comes up empty.
Those kinds of emails are so uncommon they’re absolutely not worth wasting this level of effort on. And if you’re in a sorry enough situation where that’s not the case, what you really need is the outside context the model doesn’t know. The model doesn’t know your office politics.
No one cares about the terms of service. And if they actually do, they will need to read every word very carefully to know if they are in legal trouble. A possibly wrong summary of a terms of service document is entirely and completely useless.
It's not even that they are useless, they are actively wrong. I could post pages upon pages of screenshots of the summaries being literally wrong about the content of the messages it summarised.
I find it weird that we even think we need notification summaries. If the notification body text is long or complex enough to benefit from summarizing, then the person who wrote that text has failed at the job. Notifications are summaries.
> I think Apple execs genuinely underestimated how difficult it would be to get LLMs to perform up to Apple's typical standards of polish and control
Not only Apple, this is happening across the industry. Executives' expectations of what AI can deliver are massively inflated by Amodei et al. essentially promising human-level cognition with every release.
The reality is aside from coding assistants and chatbot interfaces (a la chatgpt) we've yet to see AI truly transform polished ecosystems like smartphones and OS' for a reason.
Article says app intents, not apple events. Apple Events would be the natural thing but it's an abandoned ecosystem that would require them to walk back the past decade so of course they won't do that.
My wife was in China recently and was sending back pictures of interesting things - one came in while I was driving and my iPhone read out a description of the picture that had been sent - "How cool is that!" I thought.
However, when I stopped driving and looked at the picture the AI generated description was pretty poor - it wasn't completely wrong but it really wasn't what I was expecting given the description.
It’s been surprisingly accurate at times “a child holding an apple” in a crowded picture, and then sometimes somewhat wrong.
What really kills me is “a screenshot of a social media post” come on it’s simple OCR read the damn post to me you stupid robot! Don’t tell me you can’t, OCR was good enough in the 90s!
The description said "People standing in front of impressive scenery" (or something like that) - it got the scenery part correct but the people are barely visible and really small.
Apple's whole brand is built around tight control, predictable behavior, and a super polished UX which is basically the opposite of how LLMs behave out of the box
Which is ironic, given all I really want from Siri is an advanced-voice-chat-level chat gpt experience - being able to carry on about 90% of a natural conversation with gpt, while Siri vacillates wildly between 1) simply not responding 2) misunderstanding and 3) understand but refusing to engage - feels awful.
> get LLMs to perform up to Apple's typical standards of polish and control.
I reject this spin (which is the Apple PR explanation for their failure). LLMs already do far better than Apple’s 2025 standards of polish. Contrast things built outside Apple. The only thing holding Siri back is Apple’s refusal to build a simple implementation where they expose the APIs to “do phone things” or “do home things” as a tool call to a plain old LLM (or heck, build MCP so LLM can control your device). It would be straightforward for Apple to negotiate with a real AI company to guarantee no training on the data, etc. the same way that business accounts on OpenAI etc. offer. It might cost Apple a bunch of money, but fortunately they have like 1000 bunches of money.
I could also imagine that Apple execs might be too proud to use someone else's AI, and so wanted to train their own from scratch, but ultimately failed to do this. Totally agree that this smells like a people failure rather than a technology failure
reminds me of the attempts that companies in the game industry made to get away from steam in the 2010's - 2020's. turns out having your game developers pivot to building a proprietary virtual software market feature, and then competing with an established titan, is not an easy task.
Apple’s experience has almost nothing to do with “harnessing” LLMs, and everything to do with their wildly misjudged assumption they could run a viable model on a phone. Useful LLMs require their own power plants and can only be feasibly run in the cloud, or in a limited manner on powerful equipment like a 5090. Apple seems to have misunderstood that the “large” in large language model isn’t just a metaphor.
The thought that a company like Apple, which surely put hundreds of engineers to work on these tools and went through multiple iterations of their capabilities, would launch the capabilities...Only for its executives to realize after release that current AI is not mature enough to add significant commercial value to their products, is almost comical.
The reality is that if they hadn’t announced these tools and joined the make-believe AI bubble, their stock price would have crashed. It’s okay to spend $400 million on a project, as long as you don’t lose $50 billion in market value in an afternoon.
I'm happy they ate shit here because I like my mac not getting co-pilot bullshit forced into it, but apparently Apple had two separate teams competing against each other on this topic. Supposedly a lot of politics got in the way of delivering on a good product combined with the general difficulty of building LLM products.
I do prefer that Apple is opting to have everything run on device so you aren’t being exposed to privacy risks or subscriptions. Even if it means their models won’t be as good as ones running on $30,000 GPUs.
It also means that when the VC money runs dry, it's sustainable to run those models on-device vs. losing money running on those $$$$$ GPUs (or requiring consumers to opt for expensive subscriptions).
If you have say 16GB of GPU RAM and around 64GB of RAM and a reasonable CPU then you can make decent use of LLMs. I'm not a Apple jockey but I think you normally have something like that available and so you will have a good time, provided you curb your expectations.
I'm not an expert but it seems that the jump from 16 to 32GB of GPU RAM is large in terms of what you can run and the sheer cost of the GPU!
If you have 32GB of local GPU RAM and gobs of RAM you can rub some pretty large models locally or lots of small ones for differing tasks.
I'm not too sure about your privacy/risk model but owning a modern phone is a really bad starter for 10! You have to decide what that means for you and that's your thing and your's alone.
> Apple had two separate teams competing against each other on this topic
That is a sign of very bad management. Overlapping responsibilities kill motivation as winning the infighting becomes more important than creating a good product. Low morale, and a blaming culture is the result of such "internal competition". Instead, leadership should do their work and align goals, set clear priorities and make sure that everybody rows in the same direction.
It’s how Apple (relatively famously?) developed the iPhone, so I’d assume they were using this as a model.
> In other words, should he shrink the Mac, which would be an epic feat of engineering, or enlarge the iPod? Jobs preferred the former option, since he would then have a mobile operating system he could customize for the many gizmos then on Apple’s drawing board. Rather than pick an approach right away, however, Jobs pitted the teams against each other in a bake-off.
But that's not the same thing right? That means having two teams competing for developing the next product. That's not two organisations handling the same responsibilities. You may still end up in problems with infighting. But if there is a clear end date for that competition and then no lasting effects for the "losers" this kind of "competition" will have very different effects than setting up two organisations that fight over some responsibility
> Distrust between the two groups got so bad that earlier this year one of Giannandrea’s deputies asked engineers to extensively document the development of a joint project so that if it failed, Federighi’s group couldn’t scapegoat the AI team.
> It didn’t help the relations between the groups when Federighi began amassing his own team of hundreds of machine-learning engineers that goes by the name Intelligent Systems and is run by one of Federighi’s top deputies, Sebastien Marineau-Mes.
This is a pretty good article, and worth reading if you aren't aware that Apple has seemingly mostly abandoned the vision of on-device AI (I wasn't aware of this)
I think Apple execs genuinely underestimated how difficult it would be to get LLMs to perform up to Apple's typical standards of polish and control.
[1] https://www.bbc.com/news/articles/cge93de21n0o