> What the VLIW of Itanium needed and never really got was proper compiler support.
This is kinda under-selling it. The fundamental problem with statically-scheduled VLIW machines like Itanium is it puts all of the complexity in the compiler. Unfortunately it turns out it's just really hard to make a good static scheduler!
In contrast, dynamically-scheduled out-of-order superscalar machines work great but put all the complexity in silicon. The transistor overhead was expensive back in the day, so statically-scheduled VLIWs seemed like a good idea.
What happened was that static scheduling stayed really hard while the transistor overhead for dynamic scheduling became irrelevantly cheap. "Throw more hardware at it" won handily over "Make better software".
No, VLIW is even worse than this. Describing it as a compiler problem undersells the issue. VLIW is not tractable for a multitasking / multi tenant system due to cache residency issues. The compiler cannot efficiently schedule instructions without knowing what is in cache. But, it can’t know what’s going to be in cache if it doesn’t know what’s occupying the adjacent task time slices. Add virtualization and it’s a disaster.
If it's pure TFLOPs you're after, you do want a more or less statically scheduled GPU. But for CPU workloads, even the low-power efficiency cores in phones these days are out of order, and the size of reorder buffers in high-performance CPU cores keeps growing. If you try to run a CPU workload on GPU-like hardware, you'll just get pitifully low utilization.
So it's clearly true that the transistor overhead of dynamic scheduling is cheap compared to the (as-yet unsurmounted) cost of doing static scheduling for software that doesn't lend itself to that approach. But it's probably also true that dynamic scheduling is expensive compared to ALUs, or else we'd see more GPU-like architectures using dynamic scheduling to broaden the range of workloads they can run with competitive performance. Instead, it appears the most successful GPU company largely just keeps throwing ALUs at the problem.
I think OP meant "transistor count overhead" and that's true. There are bazillions of transistors available now. It does take a lot of power, and returns are diminishing, but there are still returns, even more so than just increasing core count. Overall what matters is performance per watt, and that's still going up.
This is kinda under-selling it. The fundamental problem with statically-scheduled VLIW machines like Itanium is it puts all of the complexity in the compiler. Unfortunately it turns out it's just really hard to make a good static scheduler!
In contrast, dynamically-scheduled out-of-order superscalar machines work great but put all the complexity in silicon. The transistor overhead was expensive back in the day, so statically-scheduled VLIWs seemed like a good idea.
What happened was that static scheduling stayed really hard while the transistor overhead for dynamic scheduling became irrelevantly cheap. "Throw more hardware at it" won handily over "Make better software".