That’s the problem: these metrics often come from overfitted or in-sample data, and are completely unrealistic when it comes to expected generalization performance.
I’m at the point where I never trust performance metrics anymore. Or rather, the worse they are, the more I trust them!
I feel like you might be conflating a couple of things, though I'm not a DS so could be off base here.
My reading of the OP's description is that the vendors were offering interpolative predictions, but did not use a test/train split of data. This is in contrast to extrapolative predictions which I would call out-of-sample.
Thus due to not using a test/train split, they achieved extremely good accuracy because they were testing on the same data they trained on. Even though this is "in-sample", you can't use the same data for testing and training.
I would imagine predictive statistics use more out-of-sample metrics like precision and recall.