But it's mostly true. Trust the compiler. They are very, very good. They have a better understanding of the performance characteristics of the current platform with the current code than the programmer. Another benefit is you do not need to maintain and update the manual optimisations once one assumption has changed. __force_inline does more bad than good.
(some compilers these days have whole programm optimisation and outlining, things that make the decision when to inline and when not to inline even harder for humans)
In some ways they are, and in some ways they're not.
I recently made a loop 5x faster by writing it slightly differently. Reason? MSVC decided to emit code that messes up store forwarding (very much a microarchitectural detail). Spelling out the pointer derefs produced much better code.
More specifically, the loop was loading ARGB values and storing them as BGR (yes, blitting on the CPU, don't ask). MSVC tried to be clever by storing the lower 16 bits of the ARGB value to the stack and then reading the individual bytes for writing. CPUs of course don't (usually) go to main memory when you write to memory and then read it, due to store forwarding. But that only works if your stores and loads are the same sizes - which 16 vs 8 bits are not. So the compiler somehow managed to make a 3 byte twiddle memory bound.
Profile, and don't be afraid of reading some assembly.
I agree that in an ideal world function boundaries would be primarily for readability and at most a minor hint to compilers. But I also think youthat give compilers way too much credit. Compilers are still generally applying one relatively simple rule after another. Often that will get good results, but it sometimes fails in surprising ways because the compiler will not be able to predict the effect of one optimization on later ones. This is where hints such as "always inline" are useful, and that doesn't even have to do anything with the target platform.
And as for PGO, yes that's useful, but it's also not a silver bullet: a) mentioned above, thresholds for local optimizations (which is what PGO affects) are not always enough and b) getting a representative profile is not trivial and also needs to be kept up to date.