It’s not terribly bad because CPUs are out-of-order. As far as I can tell, there...

It’s not terribly bad because CPUs are out-of-order. As far as I can tell, there’s no single dependency chain over all instructions in the loop body, some of these FMAs gonna run in parallel in your ISPC version. Still, I would expect manually-vectorized code to be slightly faster.