> The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally.
A fun fact I learned while writing this post is that that's no longer true! Modern branch predictors can pretty much accurately predict through a single indirect jump, if the run is long enough and the interpreted code itself has stable behavior!
My experiments on this project anecdotally agree; they didn't make it into the post but I also explored a few of the interpreters through hardware CPU counters and `perf stat`, and branch misprediction never showed up as a dominant factor.
Yes, this was already becoming true around the time I was writing the linked article. And I also read the paper. :-) I also remember I had access to a pre-Haswell era Intel CPUs vs something a bit more recent, and could see that the more complicated dispatcher no longer made as much sense.
Conclusion: the rise of popular interpreter-based languages lead to CPUs with smarter branch predictors.
How do you reconcile that with the observation that moving to a computed goto style provides better codegen in zig[1]? They make the claim that using their “labeled switch” (which is essentially computed goto) allows you to have multiple branches which improves branch predictor performance. They even get a 13% speedup in their parser from switch to this style. If modern CPU’s are good at predicting through a single branch, I wouldn’t expect this feature to make any difference.
While it's unlikely as neat as this, the blog post we're all commenting on is a "I thought we had a 10-15% speedup, but it turned out to be an LLVM optimisation misbehaving". And Zig (for now) uses LLVM for optimised builds too
(post author here) Oh, this is something I could have called out explicitly: The tail-calling interpreter relies on a feature (the `preserve_none` calling convention) that only landed in clang-19. That means you can only test it on that version. That coincidence (that 19 added both this feature, and the regression) is part of why this was so easy to miss at first, and why I had to "triangulate" with so many different benchmarks to be confident I understood what was going on.
I actually did a v0 writeup in Go, but I wanted a language that had a bit more powerful type system and more support for a fluent/functional style in some of the expressions. I was optimizing for what I know and felt was highly expressive; I've since gotten a bunch of feedback from people who are interested in the content but not comfortable reading Rust, so perhaps it wasn't the best choice.
What does the`.0` calling convention mean in Rust? e.g.
```
for (i, r) in right.0.iter().enumerate() {
out.0[i] += r;
}
```
Aside from that, I thought it was fairly legible. Great write-up by the way. Squashing things into state helps get rid of some of the spookiness created by matrix multiplication and back-propagation. I also really appreciated seeing the explanation on the actual MLP part of the transformer as that is typically assumed to be prior knowledge in other tutorials.
We used precisely this optimization in [sorbet](https://sorbet.org), a brand-new type checker for Ruby, which also contains a high-performance LSP server.
We wrote the entire thing (and tested, using ASAN and fuzzers and other techniques) to avoid leaking memory, and then strategically inserted [the equivalent of a rust `mem::forget`](https://github.com/sorbet/sorbet/blob/0aae56e73c7680ec6053b3...) into the end of the `main` driver during standalone mode, to avoid calling those destructors when we're about to exit anyways.
This optimization is definitely still relevant for new systems today.
Better yet, just run this on a t2.micro on a throwaway EC2 account. Doesn't matter if they own the box, they get literally nothing they couldn't get for free from Amazon anyways.
zero-width matches (and empty lines) were a huge source of stupid edge-case bugs in livegrep[1], and being rigorous about maintaining this mental model definitely helped a lot.
I get into this a bit later on, but I think the exact same model applies to pointers: You're much better off in most cases thinking of pointers as pointing at the zero-width points between elements, than at elements themselves.
I'm also having trouble meshing this way of thinking about indexes with the idea of a fixed bit-width, discretely addressable RAM, which would suggest that there is nothing "between" two storage elements.
I find it very useful, however, for imagining what the returned insertion point index of a binary search would mean, when the item you are looking for can not be found.
Isn't it more obvious that the indices are the addresses of the locations if the boxes are arranged vertically? Besides the boxes should contain their values not their address, an address is attached to a box, and to one box only. Better off indeed thinking of pointer existing in completely different boxes and their values as pointing at the zero-width sides of other boxes :)
This is a great question, and it's definitely a problem we have.
We don't have a single answer we use for every system we work on, but we employ a few common patterns, ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure. We have, for example, our own faked ISO-8583 [1] authorization service, which some of our tests run against to get a degree of end-to-end testing.
Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.
>Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.
Are you referring to test data or actual live transaction data? The latter would seem like a huge liability and target for hackers.
Live data, but they're stored redacted, and/or with sensitive data (e.g. credit card numbers) replaced with opaque tokens that reference an encrypted store that's carefully access-controlled.
Do you have any policy or decision made on how long you plan on storing that data? What I'm wondering, are the transactions currently "stored indefinitely"? (I'm referring to both data stores. The tokenized and the encrypted one)
> ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure.
This sounds very familiar, we rely on external credit systems pretty heavily. We started by mocking service responses and including the response XML in our unit tests. Now we have a service simulator that returns expected values and has record/playback capability. It's not ideal and responses get outdated occasionally but we haven't found a more elegant way to handle it yet.
Our testing running infrastructure spins up a pool of database instances on each worker machine, one for each worker process. The test spinup and teardown code handles schema management, hooking into our DB access layer to create and clean up database tables only if they're used by a given test.
Why do we think the blur is sufficient? The blur loses information, but there are presumably a small number of watermarked versions, and a lot of frames. It's almost certainly possible for HBO to reverse-engineer the blur algorithm — I bet it's a stock option from a common tool — and then run it over all the originals and see which produces the best post-blur match.
> The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally.
A fun fact I learned while writing this post is that that's no longer true! Modern branch predictors can pretty much accurately predict through a single indirect jump, if the run is long enough and the interpreted code itself has stable behavior!
Here's a paper that studied this (for both real hardware and a certain simulated branch predictor): https://inria.hal.science/hal-01100647/document
My experiments on this project anecdotally agree; they didn't make it into the post but I also explored a few of the interpreters through hardware CPU counters and `perf stat`, and branch misprediction never showed up as a dominant factor.