In my experience language shootouts in general are often terrible by design -- with these being just a prime example.
A lot of the problem is in the questions they're designed to answer versus the questions people use them to answer.
For example, if I'm comparing Python and C, I typically want to know "how much slower would my program be in Python?", not "how much slower is my program in Python if I spent so much time hyper-optimizing it that I might as well have written it in C?"
But the test cases usually try to answer the latter, not the former.
It might be more reasonable than it seems at first glance. It's true that it's good to know how fast typical code runs, but there's another important question: when I run into performance problems and need to optimize a bottleneck, how fast can I make it before I have to resort to non-portable code or C extensions that complicate my build process?
Anybody writing, for example, Python code to solve these sort of problems in the real world would instantly reach for numpy. Which, while not part of the core language distribution, is pretty close to being a standard library for most python programmers. I'm sure several of the other languages have similar libraries that are being ignored in these benchmarks. Without taking things like that into account, theses results don't say too many useful things about real world performance.
Languages/implementations also vary in how much overhead switching to C costs you, especially in loops say, which eg the pidigits benchmark does sort of measure.
Because there's always a "for what" in there. An an analogy, consider MMA. On the surface, it seems to answer the question "what's the best martial art?". But really, it only answers the question of what is the best martial art for fighting a single opponent in an octogon shaped ring in front of an audience aiming for a submission? And the answer is of course Brazilian Ju Jitsu, exactly the answer the founders of UFC wanted...
Former cop Rory Miller writes about this in his book, the police experimented with BJJ and found it useless. Why? Because in BJJ you pin your opponent on his back because it makes a better show for the audience, but as a cop you always pin your opponent on his front so you can handcuff him!
Ok, sure, but I'd rather see a rough attempt at getting some numbers than just throwing your hands up in the air and saying "gee, that's a hard problem".
Also, I think that we all know enough about programming and languages and their many uses that we can talk directly about it, rather than about an analogy.
Why does somebody saying "I wouldn't use this." have to provide an alternative? If such a statement is backed by reasons I find it interesting to read. They're saving me trouble trying it out, just like any other review.
"So you can either complain that they're not good, or you can try and improve them."
Yes, that sentence is literally correct. But it sounds like it's saying one option is not useful. And I still haven't heard a single reason why reviews of benchmarks are bad.
Responding to a criticism with "those who can't do criticize" is super, super boring. It's been done to death. You're just tarring all criticism with an overly broad brush. If it's bad criticism why is it worth responding to? And if it's plausible criticism why aren't you focusing on the actual details?
Maybe there's a space for "if you spent an average amount of time optimizing" :) For example my PyPy optimizations took maybe 4 hours total, with 0 time spent looking at assembly.
The question marks seem to suggest that you're poking a hole in grandparent's argument, but it seems like you're both in agreement that the shootout is misused. What am I missing?
A lot of the problem is in the questions they're designed to answer versus the questions people use them to answer.
For example, if I'm comparing Python and C, I typically want to know "how much slower would my program be in Python?", not "how much slower is my program in Python if I spent so much time hyper-optimizing it that I might as well have written it in C?"
But the test cases usually try to answer the latter, not the former.