The short answer is that they all do it well. For a loop like this compiled with -O3:
for (int i = 0; i < COUNT; i++) {
compilerSum += randomInt[i]/CONST;
}
ICC 14.0.3 takes 4.1 cycles per iteration, Clang 3.4.1 takes 3.8 cycles, and GCC 4.8.1 manages to spend only 1.3 cycles for each division!
Apparently GCC realized it was able to not only use a shortcut for the division, but also managed to vectorize the loop. Here's the inner loop it produced:
GCC's vectorized loop is using the usual "round-up" method, but we can easily implement the distributive method there by mixing 32 and 64-bit arithmetic:
The shifts can be further tweaked to allow to use a blend instead of a shuffle, in case there's a penalty for changing from integer to floating-point domains. Saturating increment also works well, despite there not being a PADDUSD instruction; it can be done in two cycles:
Nice work! Seems like it extends fine to AVX2 for doing 8 divides at a time in less than 1 cycle per division if for some reason you need to divide a whole array by the same constant. Though rare, this is impressive considering the tradition of expensive division.
I don't know if you will have a domain penalty for the vshufps, but I think the answer is yes, and that the blend would be better. The blend can also run on p015, versus only p5 for the shufps, so it might pipeline better. I hadn't realized until now that vshufps had different semantics than vpshufd and is able to combine from two different registers. If there is no penalty, this could be quite a useful instruction.
Any reason for using 'w' for your 32-bit elements rather than 'd' for double-word as used in the instruction names?
According to Agner Fog, data domain penalties with SHUFPS are nonexistent, or at least pretty hard to trigger, on Haswell. Previous chips might complain, though. Problem with blends is that they require SSE4.1+, unlike the current version that is fully SSE2 (modulo AVX encoding niceties).
I chose 'w' for no real reason other than being thinking of 'word' at the time.
I just tested, and I can now confirm that there is no domain switching penalty for using vshufps versus vpshufd on integer data for XMM and YMM on Haswell, nor for XMM on Sandy Bridge (no YMM comparison possible there because vpshufd requires AVX2).
Nice analysis! There is a similar problem in base-255 pixel math, with a similar solution. You want 255 * 255 = 255 when you multiply pixels, so you are doing divide-by-255 (not-256), which in 8 bits is somewhat "uncooperative".
I wrote up a few solutions many years ago, one of which is Sree Kotay's idea: adding one to the numerator (what this article says but without the awesome proof): http://stereopsis.com/doubleblend.html
It would be great if compilers could do all this for you, with a proof like this to go with it.
I would think the better (symmetric and slightly more accurate) solution would be to add the MSB of the result to the LSB and then round to the nearest MSB (by adding 0x80 before shifting)? e.g:
I know many compilers use this optimization. However I recently used this for the case where the divisor was a variable but restricted to a few possible values. I expanded it to a switch case statement where each possible divisor was expanded to a constant value allowing for this optimization. For this processor, there is no hardware divide so this made for significant performance gains.
My first question was: how slow is hardware divide? The article didn't answer that, but there are many resources on the net that do (more or less, it's tough to precisely benchmark these very sophisticated CPUs). One paper is at [1].
The general conclusion is that simple operations like add can be done in 1 CPU cycle. They can be easily pipelined for a throughput of 3 different add operations in 1 cycle. Modern CPUs also pipeline integer multiply so, while the latency is perhaps 4 cycles, the throughput is 1 multiply in 1 CPU cycle.
But hardware division can be slow, really really slow. A Pentium 4 might have a latency of 80 cycles for a 32 bit operation, 160 cycles for a 64 bit operation. And there's no pipelining, multiple divides don't appear to be done in parallel. Even newer CPUs like the Sandy Bridge architecture have a latency of 26/92 cycles for 32/64 bit operations.
So it's not surprising that compilers try to avoid doing hardware divide.
Note that while most compilers already replace fixed-divisor divisions with multiplications, this algorithm presents a more optimized solution for certain "problem" divisors, notably 7.
A compiler I use, based on lcc, did this automatically without even enabling optimizations.
The division is transformed into multiplication and a right shift.
I don't think your usual programmer must care about that with the modern compilers, but it is nice to know if you are implementing one or you are coding in asm.
Or if you're dealing with data that isn't compile-time constant, but can be expected to remain unchanged for the duration of something that needs to be optimized all the way.
Compilers have done this sort of thing for a very long time. Here is a well-known paper from 1994 that already tries to do it for more arbitrary constants.
No, they haven't. This extends the algorithm you reference to cover the special case of certain "magic numbers" which don't fit nicely in a machine register. From the article:
It does not seem to appear in the literature, nor is it implemented in gcc, llvm, or icc, so fish is optimistic that it is original.
Actually, looking more carefully at the Fish webpage I think they cite 7 as an example because 7 does not divide 2^32+/-1 (or indeed 2^64+/-1). If it did, the quotient would be a "magic number".
However, 7 does divide 2^33 - 1. This "magic number" does fit in a 32 bit word. It's just that division by 2^33 - 1 isn't as trivial as division by 2^32 +/- 1 (it's not much worse though).
Multiplication by/storage of a 33 bit number also isn't a problem anyway, because you can use the "IEEE trick" of storing the leading 1 bit implicitly. Multiplication by such a number can be done with one multiply and one addition. Not that you actually need to do this for the divisor 7.
The original Montgomery-Granlund work (apparently the basis of GCC division by invariant integers for quite a while) was to always find a quasi-"magic number" which fits in a word. So that was definitely not the innovation of Fish.
The followup paper of Moller-Granlund does even better by using a mullow instead of a full mul or mulhigh, which is usually available sooner on modern processors.
Fish seem to have found some highly optimised sequences. I'm not entirely clear on how new the ideas themselves are. It's clearly not the same as Montgomery-Granlund, but Montgomery-Granlund use (B^2 - 1)/d (where admittedly d has to be first "normalised"), Fish uses (2^p*B)/d where they fiddle the dividend and do some post-shifting.
I'm not even sure if Fish is better than Moller-Granlund, though the latter was 2009, not 1994.
I don't want to take away from the contribution of Fish. But it just needs to be viewed in the context of rather a lot of previous work.
You don't need the numbers to fit into a register to use the results from the paper I referenced. The same is true of the followup paper https://gmplib.org/~tege/division-paper.pdf
I agree compilers have not used such algorithms for numbers bigger than a register "for a long time". But bignum libraries certainly have.
Note that I did not say "compilers implemented precisely what fish implemented" for a long time, but compilers did "this sort of thing" for a long time.
There's another paper referred to by the original one I cited, namely this one http://dl.acm.org/citation.cfm?id=4990 (1985) which gives lots of algorithms for division by specific integers.
Alverson also developed algorithms for division by more problematic divisors in 1991:
This was in fact implemented in hardware in the 64 bit Terra computer and was the work that inspired Montgomery and Granlund. In about 1994 they implemented their algorithms in gcc.
The important point I was making is that the Montgomery-Granlund work does not rely on "magic numbers" (divisors of 2^B +/- 1).
In fact, in a certain sense, such "algorithms" go back hundreds of years. If you look in Dixon's History of Number Theory under "divisibility rules", you'll essentially find dozens of rules, which can be adapted for any base (where "base" could mean 2^32, 2^64, 2^128, 2^(32*n), etc.).
Many divisibility rules basically just consist of an algorithm for computing the remainder quickly and computing the quotient quickly, using a small number of additions, subtractions and multiplications by small constants.
That's what I was referring to when I said "this sort of thing".
The short answer is that they all do it well. For a loop like this compiled with -O3:
ICC 14.0.3 takes 4.1 cycles per iteration, Clang 3.4.1 takes 3.8 cycles, and GCC 4.8.1 manages to spend only 1.3 cycles for each division!Apparently GCC realized it was able to not only use a shortcut for the division, but also managed to vectorize the loop. Here's the inner loop it produced:
I put the other two here: https://gist.github.com/nkurz/b1c3e01498dd11fda42fI didn't look closely enough to identify which methods they were using. Can someone else tell?