". I'm not sure it ever makes sense to do comparative benchmarking, since you can't test on arbitrarily large inputs (which is what's important when looking at algorithm complexity)"
Of course it does, if you actually want the algorithm to be used in the real world.
It will get used in the real world in cases where the size of the input data justifies it. The main selling point of a better algorithm (and the main reason users would use it) is usually improved time complexity, not actual running time. Users pick it when the data is large enough to warrant it, after possibly running their own benchmarks.
If you can't even come show cases where the size of the input data gives you a better running time, nobody will ever use it.
"The main selling point of a better algorithm (and the main reason users would use it) is usually improved time complexity, not actual running time. "
I'm not sure where you get this. I have literally never seen anyone pick an algorithm based whose only benefit is "improved time complexity" when "actual running time" is worse for the majority of real world cases.
In particular, there are plenty of algorithms with "improved time complexity" but higher constant factors, and they don't get used.
As a concrete example: Most computations of dominators in compilers use theoretically worse (n log n instead of inverse ackerman) but practically better algorithms.
Also, you can't possible know when data is large enough to warrant it if nobody has taken the time to benchmark it.
Of course it does, if you actually want the algorithm to be used in the real world.