I believe my point is that the M1 seems to spend its gate budget where it really helps performance of software in practice, while x86 vendors are spending it somewhere else.
AMD in particular wasted an entire generation of Zen by having too few BTB entries. Zen1 performance on real software (i.e. not SPEC) was tragic, similar to Barcelona on a per-clock comparison. They learned this lesson and dramatically increased BTB entries on Zen2 and again for Zen3. But the question in my mind is why would you save a few pennies per million units by shaving off half the BTB entries? Doesn't make sense. They must have been guided by the wrong benchmark software.
I doubt it was about saving their pennies. Zen1 EPYC was a huge package with an expensive assembly & copious silicon area. But it was spread across 32 cores. A larger BTB probably had to come at the expense of something else.
What 'real software' are you thinking of? Anything in particular? Just curious, not looking to argue.
(sorry I changed my comment around the same time you replied)
Very large, branchy programs with very short basic blocks. This describes all the major workloads at Google, according to their paper at https://research.google/pubs/pub48320/
AMD in particular wasted an entire generation of Zen by having too few BTB entries. Zen1 performance on real software (i.e. not SPEC) was tragic, similar to Barcelona on a per-clock comparison. They learned this lesson and dramatically increased BTB entries on Zen2 and again for Zen3. But the question in my mind is why would you save a few pennies per million units by shaving off half the BTB entries? Doesn't make sense. They must have been guided by the wrong benchmark software.