More

nelhage · 2025-03-10T16:14:18 1741623258

(author here)

> The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally.

A fun fact I learned while writing this post is that that's no longer true! Modern branch predictors can pretty much accurately predict through a single indirect jump, if the run is long enough and the interpreted code itself has stable behavior!

Here's a paper that studied this (for both real hardware and a certain simulated branch predictor): https://inria.hal.science/hal-01100647/document

My experiments on this project anecdotally agree; they didn't make it into the post but I also explored a few of the interpreters through hardware CPU counters and `perf stat`, and branch misprediction never showed up as a dominant factor.

vkazanov · 2025-03-10T16:27:47 1741624067

Yes, this was already becoming true around the time I was writing the linked article. And I also read the paper. :-) I also remember I had access to a pre-Haswell era Intel CPUs vs something a bit more recent, and could see that the more complicated dispatcher no longer made as much sense.

Conclusion: the rise of popular interpreter-based languages lead to CPUs with smarter branch predictors.

What's interesting is that a token threaded interpreter dominated my benchmark (https://github.com/vkazanov/bytecode-interpreters-post/blob/...).

This trick is meant to simplify dispatching logic and also spread branches in the code a bit.

celeritascelery · 2025-03-10T19:47:48 1741636068

How do you reconcile that with the observation that moving to a computed goto style provides better codegen in zig[1]? They make the claim that using their “labeled switch” (which is essentially computed goto) allows you to have multiple branches which improves branch predictor performance. They even get a 13% speedup in their parser from switch to this style. If modern CPU’s are good at predicting through a single branch, I wouldn’t expect this feature to make any difference.

[1] https://ziglang.org/download/0.14.0/release-notes.html#Code-...

dwattttt · 2025-03-10T21:21:01 1741641661

While it's unlikely as neat as this, the blog post we're all commenting on is a "I thought we had a 10-15% speedup, but it turned out to be an LLVM optimisation misbehaving". And Zig (for now) uses LLVM for optimised builds too

nelhage · 2025-03-10T15:47:48 1741621668

> Btw, what about some clang18.tc comparison

(post author here) Oh, this is something I could have called out explicitly: The tail-calling interpreter relies on a feature (the `preserve_none` calling convention) that only landed in clang-19. That means you can only test it on that version. That coincidence (that 19 added both this feature, and the regression) is part of why this was so easy to miss at first, and why I had to "triangulate" with so many different benchmarks to be confident I understood what was going on.

nelhage · on April 2, 2022

(author here)

I actually did a v0 writeup in Go, but I wanted a language that had a bit more powerful type system and more support for a fluent/functional style in some of the expressions. I was optimizing for what I know and felt was highly expressive; I've since gotten a bunch of feedback from people who are interested in the content but not comfortable reading Rust, so perhaps it wasn't the best choice.

ShamelessC · on April 3, 2022

What does the`.0` calling convention mean in Rust? e.g. ``` for (i, r) in right.0.iter().enumerate() { out.0[i] += r; } ```

Aside from that, I thought it was fairly legible. Great write-up by the way. Squashing things into state helps get rid of some of the spookiness created by matrix multiplication and back-propagation. I also really appreciated seeing the explanation on the actual MLP part of the transformer as that is typically assumed to be prior knowledge in other tutorials.

heinrichf · on April 3, 2022

It's to access the tuple's single element, but the author could have used #[repr(transparent)]

mrfusion · on April 3, 2022

You should mention that you’re using rust. I got distracted trying I figure out what language you were using.

mrfusion · on April 3, 2022

Python would be awesome!

nelhage · on May 31, 2020

We used precisely this optimization in [sorbet](https://sorbet.org), a brand-new type checker for Ruby, which also contains a high-performance LSP server.

We wrote the entire thing (and tested, using ASAN and fuzzers and other techniques) to avoid leaking memory, and then strategically inserted [the equivalent of a rust `mem::forget`](https://github.com/sorbet/sorbet/blob/0aae56e73c7680ec6053b3...) into the end of the `main` driver during standalone mode, to avoid calling those destructors when we're about to exit anyways.

This optimization is definitely still relevant for new systems today.

nelhage · on Nov 1, 2015

Better yet, just run this on a t2.micro on a throwaway EC2 account. Doesn't matter if they own the box, they get literally nothing they couldn't get for free from Amazon anyways.

grogenaut · on Nov 1, 2015

it also becomes quite easy to lock the machine down

nelhage · on Aug 24, 2015

zero-width matches (and empty lines) were a huge source of stupid edge-case bugs in livegrep[1], and being rigorous about maintaining this mental model definitely helped a lot.

[1] livegrep.com

nelhage · on Aug 24, 2015

I get into this a bit later on, but I think the exact same model applies to pointers: You're much better off in most cases thinking of pointers as pointing at the zero-width points between elements, than at elements themselves.

rzzzt · on Aug 24, 2015

I'm also having trouble meshing this way of thinking about indexes with the idea of a fixed bit-width, discretely addressable RAM, which would suggest that there is nothing "between" two storage elements.

I find it very useful, however, for imagining what the returned insertion point index of a binary search would mean, when the item you are looking for can not be found.

igravious · on Aug 24, 2015

Isn't it more obvious that the indices are the addresses of the locations if the boxes are arranged vertically? Besides the boxes should contain their values not their address, an address is attached to a box, and to one box only. Better off indeed thinking of pointer existing in completely different boxes and their values as pointing at the zero-width sides of other boxes :)

nelhage · on Aug 13, 2015

This is a great question, and it's definitely a problem we have.

We don't have a single answer we use for every system we work on, but we employ a few common patterns, ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure. We have, for example, our own faked ISO-8583 [1] authorization service, which some of our tests run against to get a degree of end-to-end testing.

Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.

[1] https://en.wikipedia.org/wiki/ISO_8583

heywire · on Aug 13, 2015

>Back-testing is also incredibly valuable: We have repositories of every conversation or transaction we've ever exchanged with the banking networks, and when making changes to parsers or interpreters, we can compare their output against the old version on all of that historical data.

Are you referring to test data or actual live transaction data? The latter would seem like a huge liability and target for hackers.

nelhage · on Aug 13, 2015

Live data, but they're stored redacted, and/or with sensitive data (e.g. credit card numbers) replaced with opaque tokens that reference an encrypted store that's carefully access-controlled.

ersii · on Aug 14, 2015

Do you have any policy or decision made on how long you plan on storing that data? What I'm wondering, are the transactions currently "stored indefinitely"? (I'm referring to both data stores. The tokenized and the encrypted one)

kevan · on Aug 14, 2015

> ranging from just keeping hard-coded strings containing the expected output, up to and including implementing our own fake versions of external infrastructure.

This sounds very familiar, we rely on external credit systems pretty heavily. We started by mocking service responses and including the response XML in our unit tests. Now we have a service simulator that returns expected values and has record/playback capability. It's not ideal and responses get outdated occasionally but we haven't found a more elegant way to handle it yet.

ngoede · on Aug 13, 2015

What percentage of the tests are full system, integration, and unit tests?

nelhage · on Aug 13, 2015

Our testing running infrastructure spins up a pool of database instances on each worker machine, one for each worker process. The test spinup and teardown code handles schema management, hooking into our DB access layer to create and clean up database tables only if they're used by a given test.

nelhage · on April 13, 2015

Why do we think the blur is sufficient? The blur loses information, but there are presumably a small number of watermarked versions, and a lot of frames. It's almost certainly possible for HBO to reverse-engineer the blur algorithm — I bet it's a stock option from a common tool — and then run it over all the originals and see which produces the best post-blur match.