This sounds interesting - could you elaborate further on how rejection sampling ...

mananaysiempre · on June 9, 2022

I mean, as far as correctness as concerned, of course it works, but performance-wise in a raytracer I was toying around with it was a dismal failure compared to the (traditionally shunned) analytic approach.

The issue, as far as I understand it, is that LuaJIT only handles traces of two shapes: linear code; or linear loop prologue followed by linear loop body. The conditions of any branches (that could not be determined statically) are evaluated and checked against their tracing-time values, but a failed check (“guard”) throws you out of the compiled trace and back into the interpreter. This includes normal termination of a loop and misspeculated types as well as explicit conditionals.

Of course, handling only these two patterns of control flow is too limiting, so if a guard fails too often the trace compiler can start a new trace from that exit and tie its other end to the original one if it comes back. Aside from handling things like polymorphic functions which get called for more than one combination of types and predictable branches in loops that still get taken occasionally (but too rarely for unrolling), this also turns out to handle nested loops (as explained somewhere in the docs): the inner loop get traced first as a loop, then the outer loop body gets traced as a linear block that ties the end of the inner one back to its beginning, going around from the middle of the outer loop to its end, then from the beginning of its next iteration to the middle.

The catch is that register allocation or optimizations can’t see across trace boundaries, so those side traces get second-class treatment (in particular, only the inner loop gets subjected to loop-invariant code hoisting and strength reduction, as the outer ones are not viewed as loops). Additionally, the part that decides what to trace and how (is it a loop, is it hot, should it be unrolled, etc.) is an opaque pile of heuristics which is generally well tuned, but gets more confused and slower to converge as the nesting level increases.

Turns out having your innermost loop be rejection sampling which usually terminates quickly (under the unrolling threshold) but with a varying number of iterations, in a raytracer that’s already bound to have two or three levels of nested loops, plays merry hell on all of this. Occasionally LuaJIT couldn’t eliminate the GC in the now-“outer” loop and ran 5x slower, occasionally it just fell back to the interpreter and ran 20–100x slower, but generally I think that counts as a failure when the motivation is to avoid libm because it’s slow.

(For similar reasons, the small-vector library used there is almost entirely sad repetitive code along the lines of

  c.x = f(a.x, b.x)
  if one component then return c end
  c.y = f(a.y, b.y)
  if two components then return c end
  ...

because that does not count against the thresholds when you have dozens of these operations per actual iteration.)

ufo · on June 10, 2022

This is a cool story! Is that vector library public or is it an internal project? I'd love to take a look at the rest of it...

mananaysiempre · on June 10, 2022

The raytracer itself is not in a state I’d want to inflict on anybody. The vector library is just HLSL-style small vectors with subpar error reporting, there’s like half a dozen of those floating around, but if you want it, sure, go wild: http://ix.io/3ZQk (CC0). I’d only ask you to hold off on posting it to LuaRocks under that name because I want to do that myself eventually.

ufo · on June 10, 2022

Thanks! Sure, I'm not planning on reposting the code, I just collect LuaJIT stories.