I don't understand this focus on micro performance details... considering that all of this is about an interpretation approach which is always going to be slow relatively speaking. The big speed up would be to JIT it all, then you dont need to care about structuring of switch loops etc
You'd be surprised at how little speedup you get from simply JIT-compiling the Python bytecode. It's so high-level that most interesting stuff happens in the layers below anyway.
Because it is a fairly easy thing - it's a code transform that's mostly mechanical. And it also improves code quality, unusual for an optimization. So if that nets you those extra few percent, why not?