Most high-performance workloads are limited by memory-bandwidth these days. Even...

Most high-performance workloads are limited by memory-bandwidth these days. Even in HPC that became the primary bottleneck for a large percentage of workloads in the 2000s. High-performance data infrastructure is largely the same. You can drive 200 GB/s of I/O on a server in real systems today.

The memory-bandwidth bound cases is where thread-per-core tends to shine. It was the problem in HPC that thread-per-core was invented to solve and it empirically had significant performance benefits. Today we use it in high-scale databases and other I/O intensive infrastructure if performance and scalability are paramount.

That said, it is an architecture that does not degrade gracefully. I've seen more thread-per-core implementations in the wild that were broken by design than ones that were implemented correctly. It requires a commitment to rigor and thoroughness in the architecture that most software devs are not used to.