I'm under the impression that what also makes jemalloc fast is that it has a richer API than glibc malloc (in a way that isn't drop-in compatible, so libc-compatible malloc can't ever provide it), thereby giving it more info that it can use for optimization (but also stymying tools like valgrind).