Hey, big fan of the vantage instance finder. Would it be possible to add instance type labels similar to what AWS calls them on their website - "storage-optimized", "general-purpose" etc.?
I find this often useful to quickly compare similar instance types, e.g.: m7g vs. m8g vs. m9g.
I personally suffer the same issue on my X1G9, the best fix so far is the unload/load all related kernel modules one which makes throttling at least only drop to 1.2 Ghz.
Yes, -O3 tends to include a lot of features that increase code size, like aggressive loop unrolling. If you are jumping around a large amount of code, -O3 generally performs more poorly than -O2, but if you are running a tight loop (like HPC code), -O3 is better.
In the past, at a time when I worked on a very performance sensitive codebase that was also limited in scope, we compiled with -Osize and did all the loop optimizations we wanted manually (and with pragmas). That produced faster code than -O2 or -O3.
Regarding unrolling, -O3 contains -funroll-and-jam but not -funroll-loops. You may want one or the other, maybe both, depending on circumstances. I don't see much benefit from the available pragmas on HPC-type code unless for OpenMP, and "omp simd" isn't necessary to get vectorization in the places I've seen people say it is. Mileage always varies somewhat, of course. (Before second-guessing anything, use -fopt-info.)
Modern x86 CPUs have micro instr caches to store small loops (about 50 instr) and medium loops (~2k instr). Also, the bottleneck is usually the instruction decoding (Alder Lake made huge changes on that, so this might change).
In other words, loop unrolling is, more often than not, harmful.
Probably just some tweaks to O2 would be enough, after all people are selecting Os over O2 because they see better performance, and that should not be happening.
In the application I referred to, PGO was also used. However, that only applies -Os to cold code, and if what you're doing is very branchy, it can help even in the hot path.
I agree with you that one can very often get distraced by single events, however knowing that you are frontend/backend bound isn't all that more helpful either.
For frontend you can guess that PGO, BOLT, huge tables might probably help but it's still a blind guess without knowing what to look at next.
Intel's TMA is the only helpful thing here really. Bit sad that AMD and ARM don't provide a way to calculate something TMA-like themselves.
I find this often useful to quickly compare similar instance types, e.g.: m7g vs. m8g vs. m9g.