The rigorous mathematical justification this trick works better than not using this trick is that speech recognition works better when used. It is for theory to explain reality rather than the other way around.
Specifically, what justifies this is that the parameter is optimized via cross-validation, so within the space that we're optimizing over, we know that we're picking a good parameter. If it's not equal to 1, then it turns out that the "rigorous" probabilistic model was missing something that turned out to be useful.
In this case, it's that the probabilistic model only incorporates naively measured probabilities, and does not account for the error in those measured values. This is rather tricky to do right - you need to know a lot about the true prior distributions as well as the error distributions in order to account for it, and generally we don't have enough information to do this effectively.
So we "cheat", and just use the probabilistic model as a starting point to launch an optimization against.
Stated differently, it is rigorous if you accept that A and W are what we think they are that we want to find w`, that we can as w` = argmax P(A|W)P(W).
The problem is that we can't know P(A|W) or P(W). We must choose a family of distributions (a model) that we believe they live in and then perform statistical estimation to find the best representative of those families, call them f`(A|W) and g`(W). We can induce error here in three ways (1) choosing too small a family (2) having too little data to find the best representative of our family (3) having finite resources and thus giving up before we even get the optimal one we could have found.
(1), (2), and (3) are in contention. For instance, increasing (1) can be done so we can quickly process more data to reduce (2) and (3). It's a battle of tradeoffs.
The fudge factor is induced to compensate for the fact that P(W) is somewhat easier to estimate that P(A|W) (but don't get me wrong, they're both technically difficult). That means that f`(A|W) tends to be blurrier than g`(W) (has lower Fisher information, perhaps). Considering f` and g` to be on equal ground is then pretty silly, so giving more power to the sharper model improves classification accuracy.
---
The point being, it's not actually surprising that the fudge factor exists. It's more of a conflation of ideas and notation that makes it seem confusing.
(P.S. read f` as "f star", the asterisks were being eaten)
There isn't one. It's an engineering solution that just happens to work pretty often in ASR.
Complex P(W)'s are almost certainly fit by MLE. The models used are super complex, so MLE is often the only practical estimate. There are MaxEnt and Bayes methods, but I think generally people trade computational tractability of enormous models which wouldn't gain much from prior information because there's just so much data available.
The power law smells a bit like P(W) were itself some maximum-likelihood estimate.