This, like almost every counter point I've faced thus far, is simply *wrong*. It...

mikeash · on Jan 21, 2014

I don't understand your "because" statement. Intel added a check for Intel CPUs because vectorization is difficult. That's a complete non sequitur as far as I can tell. It makes as much sense as saying that I baked a chocolate cake because it rained yesterday.

Yes, various optimizations, including auto-vectorization, are difficult. Why does that mean Intel had to add a check for Intel CPUs in their compiler?

corresation · on Jan 21, 2014

I'm a glutton for punishment, I suppose.

The Intel compiler makes tight, fast x86[^1]. It ALSO can optionally generate auto-vectorized code paths for specific Intel architectures (it is not simply "has feature versus doesn't have feature", but instead chooses the usage profile of features based on the runtime architecture. Each architecture has significant nuances, setup and teardown costs, etc, and anyone who says "they should just feature sniff" does not understand the factors, though that certainly doesn't stop them from having an opinion), for that small amount of niche code that can be vectorized. Saying that because they don't do the latter for AMD processors means they "crippled" them is nonsensical.

Just to be clear, I have heavily used the Intel compiler for back-office financial applications. I'm not just repeating some opinion I happened across. Nor do I have any particularly love for Intel.

Further, if you understand that Intel specifically targets specific Intel architectures with every branch path, saying "well just run it on all things", again, you simply don't understand the discussion, or the architecture based dispatcher. Yeah, "just run it" might run perfectly fine, and for a contrived example might yield better runtimes, but it also can yield runtime errors or actual performance losses.

As I have repeatedly stated, we should expect great cross architecture and platform (including ARM, which with NEON also has vectorization) compilation with auto-vectorization from the dominant compilers, including GCC, LLVM, and VC. But somehow it always returns to the Intel compiler, nine years after they publicly stated "Yeah, this is for Intel targets".

^1 - So much so that in almost all of these conversations, the people who complain about Intel compilers still use them because it still generates the fastest code for AMD processors, vectorization or not. Which is pretty bizarre, really.

mikeash · on Jan 21, 2014

Well, your explanation seems completely at odds with what is currently the top-voted comment in this discussion. The linked discussion of the patch he built to fix the problem indicates that the dispatcher does just do CPU feature detection. Here is the URL for reference:

http://www.swallowtail.org/naughty-intel.shtml

According to that, the code simply does a feature check for SSE, SSE2, and SSE3. Except it also does a check for "GenuineIntel" and treats its absence as "no SSE of any kind" even if the CPU otherwise indicates that it does SSE. That check is completely unnecessary and does nothing but slow (or crash!) the code on non-Intel CPUs.

If you still think that's wrong, could you post the relevant code to show it?

corresation · on Jan 21, 2014

That link doesn't actually show what it does to determine whether to use SSE or SSE2 (much less SSE3 and beyond). That it derives a boolean value is not the same as feature detection.

Further the bulk of that entry was from 2004, which is pertinent given that at the time the new Pentium 4 was the first Intel processor with SSE2, and the SSE implementation on the Pentium III was somewhat of a disaster -- both single-precision width (it simulated 128-bits through two 64-bit operations, and for the P3 compilers could optimize for its specific handicap), and sharing resources with the floating point unit. So the feature flag, coupled with "GenuineIntel", was all they needed to know for the two possible Intel variants with support.

Since then the dispatcher and options have grown dramatically more complex as the number of architectures and permutations have exploded.

mikeash · on Jan 21, 2014

Well, here's a complete analysis of the function:

http://publicclu2.blogspot.com/2013/05/analysis-of-intel-com...

Unfortunately, it doesn't show the raw assembly. But in the absence of any information to the contrary, I'm perfectly happy to trust this pseudocode. It shows a bunch of feature checks, preceded by a single "GenuineIntel" check. The code that's gated on "GenuineIntel" would work just fine on non-Intel CPUs. It might sometimes produce sub-optimal results, but overall it'll be fine. There are some CPU family checks, but my understanding is that non-Intel CPUs return the same values that Intel CPUs do for similar architectures/capabilities.

We have multiple people saying that the code runs faster if the "GenuineIntel" checks are removed, we have pseudocode for the function in question that shows a bunch of feature detection with a bit of CPU family detection, neither of which are at all Intel specific. And then we have you, who can't seem to substantiate your claims at all.

If you have actual code or other reasonable evidence to support what you're saying, I'd love to see it. But right now, I'm not buying it.