> You should first check if the compiler doesn't do it for you.
No, compiler cannot rewrite your bytecode into a threaded code.
> Otherwise you're just making your code harder to maintain and less portable for zero gain.
No. Threaded code is almost always faster, even if run only once.
> a proper jump table
I'm not talking about jump tables. I'm talking about replacing the bytecode indices with jump addresses once (yes, with a precalculated table) and then doing an indirect jump for each translated opcode from the input stream.
I don't follow you, what do computed gotos have to do with threaded code?
I'm not sure I understand the code you've linked but "goto (void )(jumptbl_base + *pc++)" looks very much like my big "match" statement except that I match only a subset of the 32 bit instruction.
I don't see why the compiler wouldn't be able to rewrite a bit switch/match statement as a jump table without having to spell it out directly.
> I'm not talking about jump tables. I'm talking about replacing the bytecode indices with jump addresses once (yes, with a precalculated table) and then doing an indirect jump for each translated opcode from the input stream.
> I don't follow you, what do computed gotos have to do with threaded code?
1. Build a jump table
2. Read a bytecode
3. Replace all the instruction codes in the bytecode stream with values from the jump table (i.e., build a threaded code from your bytecode).
4. In a loop, take the next value at PC and jump to it - it's already an address. No table lookups (just once, when you load your bytecode). So it's much faster than any switch can be.
EDIT: probably you did not notice that for most of the targets jumptbl_base = 0, and an indirection is removed.
Oh I think I get it, thanks (although I still don't understand what "threaded" means in this context).
The only downside I can see is that you'd significantly increase the size of the code. On a 64bit architecture you trade each byte for a 64bit address, effectively multiplying by eight the footprint in the data cache. A 256 entry LUT on the other hand will fit snugly in cache and the lookup shouldn't be very costly.
Also if I understood you correctly what you're proposing doesn't have much to do with the "computed gotos" extension.
> The only downside I can see is that you'd significantly increase the size of the code
This is why adding the jumptbl_base on 64-bit platforms. Pointers are still 32-bit, just with an added offset (and a tiny overhead).
> doesn't have much to do with the "computed gotos" extension
You cannot implement indirect threaded code without a computed goto. You can implement a direct threaded code, of course (generating jump or call instructions directly), but this is a totally different thing.
Why do you want any inlining? It's a threaded code. You know an address of each instruction handler, so you can replace an opcode with this address and eliminate a switch and any table lookups altogether.
No, compiler cannot rewrite your bytecode into a threaded code.
> Otherwise you're just making your code harder to maintain and less portable for zero gain.
No. Threaded code is almost always faster, even if run only once.
> a proper jump table
I'm not talking about jump tables. I'm talking about replacing the bytecode indices with jump addresses once (yes, with a precalculated table) and then doing an indirect jump for each translated opcode from the input stream.
An example: https://github.com/ocaml/ocaml/blob/trunk/byterun/interp.c
(see the Next macro definition)