Can it? I understand it’s always possible to decompose that to if (a[i] > 0) ret...

tylerhou · on Dec 29, 2023

In the branchless version, the CPU has to wait for the comparison to resolve before it can start executing the add for the next loop iteration. However, if the branch is predictable, the CPU can assume the result of the conditional and does not need to wait to add one or not. I wrote a more in depth comment about why this is true a few months ago: https://news.ycombinator.com/item?id=37245594

If I alter the code slightly to do `result += (a[i] == 0) * 2`, gcc emits a branch if the comparison is predictable: https://godbolt.org/z/df3fsoYK8

Here is a benchmark: https://quick-bench.com/q/NSGHu_wfhrMXp0-pZQp9qybCIok. Note how the branchless version takes the same time for the random and the zeroes vector, while the branch version is faster when the branch is predictable but slower when the branch is not predictable.

cesarb · on Dec 29, 2023

> Comparison operations are basic primitives that usually store their result into a register.

In one of the most common processor architectures (the x86 family), comparison operations do store their result into a register, but it's the flags register, which can't be used directly in arithmetic operations. So you have to follow a comparison operation with either a conditional branch or a conditional move (and earlier processors in the x86 family didn't have conditional moves).

> So I guess my question is, while it’s technically possible for a compiler to compile this into a branching operations, under what circumstances would a compiler actually choose to do that, given there’s isn’t a clear benefit?

It depends on the compiler heuristics and on the surrounding code; for instance, it might decide that "compare; conditional branch over next instruction; increment" is better than "copy to second register; increment second register; compare; conditional move from second register", because it uses one less register (the x86 family is register starved) and one less instruction (relevant when optimizing for size).

xoranth · on Dec 29, 2023

> So you have to follow a comparison operation with either a conditional branch or a conditional move (and earlier processors in the x86 family didn't have conditional moves).

The x86 family has the `setCC` instructions [^1] that move bits from the flag register to a general purpose one. Example from godbolt, see `setg`:

https://c.godbolt.org/z/MY37oP9vz

[^1]: https://www.felixcloutier.com/x86/setcc

gpderetta · on Dec 29, 2023

SETcc is indeed what GCC typically uses. You can also play tricks with the carry flag and ADC but I don't think I have ever seen GCC do it.

xoranth · on Dec 30, 2023

The latest version can [^1], though anecdotally I've seen clang/LLVM being smarter about it.

[^1]: https://c.godbolt.org/z/vP8edfen7