It's not very friendly to the optimizer though. In order to achieve the same performance as a templated generic version by eliminating vtable dispatch, the optimizer needs to (1) convert the recursive sort function to a loop; (2) inline the sort function; (3) promote the interface to the stack via escape analysis; (4) SROA the interface; (5) constant-propagate the SROA'd functions to their use sites; (6) inline the now-constant functions. You need a pretty powerful optimization framework with a well-tuned pipeline like GCC or LLVM in order to do all of that.