Faster Unsigned Division by Constants (2011)

nkurz · on Sept 25, 2014

I just tested GCC, ICC, and Clang to see how well they perform these optimizations on their own. Code for my test is here: https://gist.github.com/nkurz/6844adbc58a03eb3420f

The short answer is that they all do it well. For a loop like this compiled with -O3:

    for (int i = 0; i < COUNT; i++) {
        compilerSum += randomInt[i]/CONST;
    }

ICC 14.0.3 takes 4.1 cycles per iteration, Clang 3.4.1 takes 3.8 cycles, and GCC 4.8.1 manages to spend only 1.3 cycles for each division!

Apparently GCC realized it was able to not only use a shortcut for the division, but also managed to vectorize the loop. Here's the inner loop it produced:

  4019f0:       66 0f 6f 00             movdqa (%rax),%xmm0
  4019f4:       48 83 c0 10             add    $0x10,%rax
  4019f8:       48 39 c5                cmp    %rax,%rbp
  4019fb:       66 0f 6f c8             movdqa %xmm0,%xmm1
  4019ff:       66 0f 6f e0             movdqa %xmm0,%xmm4
  401a03:       66 0f 62 c8             punpckldq %xmm0,%xmm1
  401a07:       66 0f 6a e0             punpckhdq %xmm0,%xmm4
  401a0b:       66 0f f4 cb             pmuludq %xmm3,%xmm1
  401a0f:       66 0f f4 e3             pmuludq %xmm3,%xmm4
  401a13:       0f c6 cc dd             shufps $0xdd,%xmm4,%xmm1
  401a17:       66 0f fa c1             psubd  %xmm1,%xmm0
  401a1b:       66 0f 72 d0 01          psrld  $0x1,%xmm0
  401a20:       66 0f fe c1             paddd  %xmm1,%xmm0
  401a24:       66 0f 72 d0 02          psrld  $0x2,%xmm0
  401a29:       66 0f fe d0             paddd  %xmm0,%xmm2
  401a2d:       75 c1                   jne    4019f0 <main+0xb0>

I put the other two here: https://gist.github.com/nkurz/b1c3e01498dd11fda42f

I didn't look closely enough to identify which methods they were using. Can someone else tell?

pbsd · on Sept 25, 2014

GCC's vectorized loop is using the usual "round-up" method, but we can easily implement the distributive method there by mixing 32 and 64-bit arithmetic:

    ; xmm7 = 0 magic 0 magic
    ; xmm6 = 0
    ; xmm5 = 0 (accumulator)
    division_loop:
      vmovdqu  xmm4, [rax]         ; xmm4 = w3 w2 w1 w0
      
      vpunpckldq xmm0, xmm4, xmm6  ; xmm0 = 0 w1 | 0 w0
      vpunpckhdq xmm2, xmm4, xmm6  ; xmm2 = 0 w3 | 0 w2
      vpmuludq xmm1, xmm0, xmm7    ; xmm1 = w1 * magic | w0 * magic
      vpmuludq xmm3, xmm2, xmm7    ; xmm3 = w3 * magic | w2 * magic
      vpaddq   xmm1, xmm1, xmm0    ; xmm1 = w1 * magic + w1 | w0 * magic + w0
      vpaddq   xmm3, xmm3, xmm2    ; xmm3 = w3 * magic + w3 | w2 * magic + w2
      vpsrlq   xmm1, xmm1, XX      ; XX = shift constant
      vpsrlq   xmm3, xmm3, XX      ; XX = shift constant
      
      vshufps    xmm1, xmm1, xmm3, 10001000b ; back to w3 w2 w1 w0
      vpaddd     xmm5, xmm5, xmm1  ; add to accumulator 
      
      add rax, 16
      cmp rax, rbp
    jnz division_loop

The shifts can be further tweaked to allow to use a blend instead of a shuffle, in case there's a penalty for changing from integer to floating-point domains. Saturating increment also works well, despite there not being a PADDUSD instruction; it can be done in two cycles:

      ; xmm6 = -1 -1 -1 -1
      vpcmpeqd xmm3, xmm4, xmm6 ; xmm3[i] = (xmm4[i] == -1 ? -1 : 0) 
      vpsubd   xmm4, xmm4, xmm6 ; xmm4[i] += 1  
      vpaddd   xmm4, xmm4, xmm3 ; xmm4[i] += xmm3[i]

nkurz · on Sept 25, 2014

Nice work! Seems like it extends fine to AVX2 for doing 8 divides at a time in less than 1 cycle per division if for some reason you need to divide a whole array by the same constant. Though rare, this is impressive considering the tradition of expensive division.

I don't know if you will have a domain penalty for the vshufps, but I think the answer is yes, and that the blend would be better. The blend can also run on p015, versus only p5 for the shufps, so it might pipeline better. I hadn't realized until now that vshufps had different semantics than vpshufd and is able to combine from two different registers. If there is no penalty, this could be quite a useful instruction.

Any reason for using 'w' for your 32-bit elements rather than 'd' for double-word as used in the instruction names?

pbsd · on Sept 25, 2014

According to Agner Fog, data domain penalties with SHUFPS are nonexistent, or at least pretty hard to trigger, on Haswell. Previous chips might complain, though. Problem with blends is that they require SSE4.1+, unlike the current version that is fully SSE2 (modulo AVX encoding niceties).

I chose 'w' for no real reason other than being thinking of 'word' at the time.

nkurz · on Sept 26, 2014

I just tested, and I can now confirm that there is no domain switching penalty for using vshufps versus vpshufd on integer data for XMM and YMM on Haswell, nor for XMM on Sandy Bridge (no YMM comparison possible there because vpshufd requires AVX2).

kenferry · on Sept 25, 2014

Fish got his code committed into clang. I'm not sure about gcc, but it might have copied it from there.

herf · on Sept 25, 2014

Nice analysis! There is a similar problem in base-255 pixel math, with a similar solution. You want 255 * 255 = 255 when you multiply pixels, so you are doing divide-by-255 (not-256), which in 8 bits is somewhat "uncooperative".

I wrote up a few solutions many years ago, one of which is Sree Kotay's idea: adding one to the numerator (what this article says but without the awesome proof): http://stereopsis.com/doubleblend.html

It would be great if compilers could do all this for you, with a proof like this to go with it.

colanderman · on Sept 25, 2014

I would think the better (symmetric and slightly more accurate) solution would be to add the MSB of the result to the LSB and then round to the nearest MSB (by adding 0x80 before shifting)? e.g:

    ((0xFF*0xFF = 0xFE01) + 0xFE + 0x80 = 0xFF7F) >> 8 = 0xFF
    ((0xC0*0xC0 = 0x9000) + 0x90 + 0x80 = 0x9110) >> 8 = 0x91
    ((0xB4*0xB5 = 0x7F44) + 0x7F + 0x80 = 0x8043) >> 8 = 0x80
    ((0xB4*0xB4 = 0x7E90) + 0x7E + 0x80 = 0x7F8E) >> 8 = 0x7F
    ((0x80*0x80 = 0x4000) + 0x40 + 0x80 = 0x40C0) >> 8 = 0x40
    ((0x00*0x00 = 0x0000) + 0x00 + 0x80 = 0x0080) >> 8 = 0x00

(To be precise, you should also add the MSB to the right of the LSB, etc., since 1/.FF = 1.01010101..., but now you're getting lost in the noise.)

EDIT: screwed up some math; thanks gjm11.

gjm11 · on Sept 25, 2014

But 1/.FF = 1.01010101..., not 1.02030405... .

(Multiplying the RHS by FF gives FF.FFFFFFFFFF etc. = 100. 1.020304... is 1/(.FF)^2.)

colanderman · on Sept 25, 2014

Whoops, good catch! Edited.

pkaye · on Sept 25, 2014

I know many compilers use this optimization. However I recently used this for the case where the divisor was a variable but restricted to a few possible values. I expanded it to a switch case statement where each possible divisor was expanded to a constant value allowing for this optimization. For this processor, there is no hardware divide so this made for significant performance gains.

WalterBright · on Sept 25, 2014

I think all modern compilers do this now. I know the D compiler does:

https://github.com/D-Programming-Language/dmd/blob/master/sr...

simonster · on Sept 25, 2014

Modern compilers do some variant of fast integer division by constants, but I think most use the "round-up" algorithm described in an earlier post in this series (http://ridiculousfish.com/blog/posts/labor-of-division-episo...) and a variety of other places and not the "round-down" algorithm described in this post when the multiplier is a bit too big. You'd know better than I, but it looks like this is the round-up algorithm: https://github.com/D-Programming-Language/dmd/blob/master/sr...

PhantomGremlin · on Sept 27, 2014

My first question was: how slow is hardware divide? The article didn't answer that, but there are many resources on the net that do (more or less, it's tough to precisely benchmark these very sophisticated CPUs). One paper is at [1].

The general conclusion is that simple operations like add can be done in 1 CPU cycle. They can be easily pipelined for a throughput of 3 different add operations in 1 cycle. Modern CPUs also pipeline integer multiply so, while the latency is perhaps 4 cycles, the throughput is 1 multiply in 1 CPU cycle.

But hardware division can be slow, really really slow. A Pentium 4 might have a latency of 80 cycles for a 32 bit operation, 160 cycles for a 64 bit operation. And there's no pipelining, multiple divides don't appear to be done in parallel. Even newer CPUs like the Sandy Bridge architecture have a latency of 26/92 cycles for 32/64 bit operations.

So it's not surprising that compilers try to avoid doing hardware divide.

[1] https://gmplib.org/~tege/x86-timing.pdf

TheLoneWolfling · on Sept 27, 2014

I hadn't realized hardware division was that slow, yow.

I suppose that part of this is that we now have 64-bit operations everywhere - the length of the critical path for a 64-bit divide is absurd.

colanderman · on Sept 25, 2014

Note that while most compilers already replace fixed-divisor divisions with multiplications, this algorithm presents a more optimized solution for certain "problem" divisors, notably 7.

readerrrr · on Sept 26, 2014

A compiler I use, based on lcc, did this automatically without even enabling optimizations.

The division is transformed into multiplication and a right shift.

I don't think your usual programmer must care about that with the modern compilers, but it is nice to know if you are implementing one or you are coding in asm.

clarry · on Sept 26, 2014

Or if you're dealing with data that isn't compile-time constant, but can be expected to remain unchanged for the duration of something that needs to be optimized all the way.

wbhart · on Sept 25, 2014

Compilers have done this sort of thing for a very long time. Here is a well-known paper from 1994 that already tries to do it for more arbitrary constants.

https://gmplib.org/~tege/divcnst-pldi94.pdf

colanderman · on Sept 25, 2014

No, they haven't. This extends the algorithm you reference to cover the special case of certain "magic numbers" which don't fit nicely in a machine register. From the article:

It does not seem to appear in the literature, nor is it implemented in gcc, llvm, or icc, so fish is optimistic that it is original.

wbhart · on Sept 25, 2014

Actually, looking more carefully at the Fish webpage I think they cite 7 as an example because 7 does not divide 2^32+/-1 (or indeed 2^64+/-1). If it did, the quotient would be a "magic number".

However, 7 does divide 2^33 - 1. This "magic number" does fit in a 32 bit word. It's just that division by 2^33 - 1 isn't as trivial as division by 2^32 +/- 1 (it's not much worse though).

Multiplication by/storage of a 33 bit number also isn't a problem anyway, because you can use the "IEEE trick" of storing the leading 1 bit implicitly. Multiplication by such a number can be done with one multiply and one addition. Not that you actually need to do this for the divisor 7.

The original Montgomery-Granlund work (apparently the basis of GCC division by invariant integers for quite a while) was to always find a quasi-"magic number" which fits in a word. So that was definitely not the innovation of Fish.

The followup paper of Moller-Granlund does even better by using a mullow instead of a full mul or mulhigh, which is usually available sooner on modern processors.

Fish seem to have found some highly optimised sequences. I'm not entirely clear on how new the ideas themselves are. It's clearly not the same as Montgomery-Granlund, but Montgomery-Granlund use (B^2 - 1)/d (where admittedly d has to be first "normalised"), Fish uses (2^p*B)/d where they fiddle the dividend and do some post-shifting.

I'm not even sure if Fish is better than Moller-Granlund, though the latter was 2009, not 1994.

I don't want to take away from the contribution of Fish. But it just needs to be viewed in the context of rather a lot of previous work.

colanderman · on Sept 26, 2014

Thanks, that's a very detailed analysis.

wbhart · on Sept 25, 2014

You don't need the numbers to fit into a register to use the results from the paper I referenced. The same is true of the followup paper https://gmplib.org/~tege/division-paper.pdf

I agree compilers have not used such algorithms for numbers bigger than a register "for a long time". But bignum libraries certainly have.

Note that I did not say "compilers implemented precisely what fish implemented" for a long time, but compilers did "this sort of thing" for a long time.

There's another paper referred to by the original one I cited, namely this one http://dl.acm.org/citation.cfm?id=4990 (1985) which gives lots of algorithms for division by specific integers.

Alverson also developed algorithms for division by more problematic divisors in 1991:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.1...

This was in fact implemented in hardware in the 64 bit Terra computer and was the work that inspired Montgomery and Granlund. In about 1994 they implemented their algorithms in gcc.

The important point I was making is that the Montgomery-Granlund work does not rely on "magic numbers" (divisors of 2^B +/- 1).

In fact, in a certain sense, such "algorithms" go back hundreds of years. If you look in Dixon's History of Number Theory under "divisibility rules", you'll essentially find dozens of rules, which can be adapted for any base (where "base" could mean 2^32, 2^64, 2^128, 2^(32*n), etc.).

Many divisibility rules basically just consist of an algorithm for computing the remainder quickly and computing the quotient quickly, using a small number of additions, subtractions and multiplications by small constants.

That's what I was referring to when I said "this sort of thing".

I give a bunch of such rules here: http://wbhart.github.io/SimpleMath/divisrules.html

where obviously there I give the rules for base 10. But some of them were even originally stated for general bases.

phkahler · on Sept 25, 2014

From 2011, has this made it into gcc? Is it part of LLVM now since they tested with that?

simonster · on Sept 25, 2014

It looks like LLVM trunk still uses a subtraction and a shift for "uncooperative" divisors: https://github.com/llvm-mirror/llvm/blob/1bdc6c4519d9ad254b2...

jevinskie · on Sept 25, 2014

I found two other implementations of software division in LLVM:

https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor...