These numbers are impressive, at least to say. It looks like Google has produced a beast that will raise the bar even higher. What's even more impressive is how Google came into this game late and went from producing a few flops to being the leader at this (actually, they already achieved the title with 2.5 Pro).
What makes me even more curious is the following
> Model dependencies: This model is not a modification or a fine-tune of a prior model
Google was never really late. Where people perceived Google to have dropped the ball was in its productization of AI. The Google's Bard branding stumble was so (hilariously) bad that it threw a lot of people off the scent.
My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.
Google’s productization is still rather poor. If I want to use OpenAI’s models, I go to their website, look up the price and pay it. For Google’s, I need to figure out whether I want AI Studio or Google Cloud Code Assist or AI Ultra, etc, and if this is for commercial use where I need to prevent Google from training on my data, figuring out which options work is extra complicated.
As of a couple weeks ago (the last time I checked) if you are signed in to multiple Google accounts and you cannot accept the non-commercial terms for one of them for AI Studio, the site is horribly broken (the text showing which account they’re asking you to agree to the terms for is blurred, and you can’t switch accounts without agreeing first).
In Google’s very slight defense, Anthropic hasn’t even tried to make a proper sign in system.
Like, kind of unreasonably good. You’d expect some perfunctory Electronic app that just barely wraps the website. But no, you get something that feels incredibly polished…more so than a lot of recent apps from Apple…and has powerful integrations into other apps, including text editors and terminals.
Bard was horrible compared to the competition of the time.
Gemini 1.0 was strictly worse than GPT-3.5 and was unusable due to "safety" features.
Google followed that up with 1.5 which was still worse than GPT-3.5 and unbelievably far behind GPT-4. At this same time Google had their "black nazi" scandals.
With Gemini 2.0 finally had a model that was at least useful for OCR and with their fash series a model that, while not up to par in capabilities, was sufficiently inexpensive that it found uses.
Only with Gemini-2.5 did Google catch up with SoTA. It was within "spitting distance" of the leading models.
Google did indeed drop the ball, very, very badly.
I suspect that Sergey coming back helped immensely, somehow. I suspect that he was able to tame some of the more dysfunctional elements of Google, at least for a time.
I feel like 1.5 was still pretty good -- my school blocked chatgpt at the time but didn't bother with anything else, so I was using it more than anything else for general research help and it was fine. The blocking fact is probably the biggest reason I use Gemini 90% of the time now, because school can never block google search and ai mode is in that now. That, and the android integration.
To be fair, for my use case (apart from GitHub copilot stuff with Claude 4.5 sonnet) I've never noticed too big of a difference between the actual models, and am more inclined to judge them by their ancillary services and speed, which google excells in.
Oh, I remember the times when I compared Gemini with ChatGPT and Claude. Gemini was so far behind, it was barely usable. And now they are pushing the boundries.
You could argue that chat-tuning of models falls more along the lines of product competence. I don't think there was a doubt about the upper ceiling of what people thought Google could produce.. more "when will they turn on the tap" and "can Pichai be the wartime general to lead them?"
Google was catastrophically traumatized throughout the org when they had that photos AI mislabel black people as gorillas. They turned the safety and caution knobs up to 12 after that for years, really until OpenAI came along and ate their lunch.
oh they were so late there were internal leaked ('leaked'?) memos about a couple grad students with $100 budget outdoing their lab a couple years ago. they picked themselves up real nice, but it took a serious reorg.
Apple is struggling with _productizing_ LLMs for the mass market, which is a separate task from training a frontier LLM.
To be fair to Apple, so far the only mass market LLM use case so far is just a simple chatbot, and they don't seem to be interested in that. It remains to be seen if what Apple wants to do ("private" LLMs with access to your personal context acting as intimate personal assistants) is even possible to do reliably. It sounds useful, and I do believe it will eventually be possible, but no one is there yet.
They did botch the launch by announcing the Apple Intelligence features before they are ready though.
The may want to use 3rd party or just wait for AI to be more stable to see how people actually use it instead of adding slop in the core of their product.
This is revisionist history. Apple wanted to fully jump in. They even rebranded AI as Apple Intelligence and announced a hoard of features which turned out to be vaporware.
There are no leaders. Every other month a new LLM model comes out and it outperforms the previous ones by a small margin, the benchmarks always look good (probably because the models are trained on the answers) but then in practice they are basically indistinguishable from the previous ones (take GPT4 vs 5). We've been in this loop since around the release of ChatGPT 4 where all the main players started this cycle.
The biggest strides in the last 6-8 months have been in generative AIs, specifically for animation.
I hope they keep the pricing similar to 2.5 Pro, currently I pay per token and that and GPT-5 are close to the sweet spot for me but Sonnet 4.5 feels too expensive for larger changes. I've also been moving around 100M tokens per week with Cerebras Code (they moved to GLM 4.6), but the flagship models still feel better when I need help with more advanced debugging or some exemplary refactoring to then feed as an example for a dumber/faster model.
It's not like they're making their money from this though. All AI work is heavily subsidised, for Alphabet it just happens that the funding comes from within the megacorp. If MS had fully absorbed OpenAI back when their board nearly sunk the boat, they'd be in the exact same situation today.
They're not making money, but they're in a much better situation than Microsoft/OpenAI because of TPUs. TPUs are much cheaper than Nvidia cards both to purchase and to operate, so Google's AI efforts aren't running at as much of a loss as everyone else. That's why they can do things like offer Gemini 3 Pro for free.
A lot of major providers offer their cutting edge model for free in some form these days, that's merely a market penetration strategy. At the end of the day (if you look at the cloud prices), TPUs are only about 30% cheaper. But NVidia produces orders of magnitude more cards. So Google will certainly need more time to train and globally deploy inference for their frontier models. For example, I doubt they could do with TPUs what xAI did with Nvidia cards.
What makes me even more curious is the following
> Model dependencies: This model is not a modification or a fine-tune of a prior model
So did they start from scratch with this one?