I really like the idea of having many different metrics that you automatically track for every test you run. I've been telling people to do that for years, and saying that the fact that standard A/B test frameworks don't is good enough reason to roll your own.
However suppose that you have 20 metrics you are following, and you run 20 tests. The odds are that by chance 4 times you'll have tests showing 99% confidence on random metrics for random results. This is just a side-effect of having many tests and many metrics.
Therefore if you find yourself in that situation, you should be very predisposed to assume that random results you were not expecting on random metrics that seem unconnected to your test really are due to random chance. Because the odds of weird chance results are higher than you would have guessed.
Amen to that. This is an issue we've been very aware of recently, and we're discussing various possibilities for mitigating.
The point about using historical graphs instead of single significance numbers reduces, but does not solve, the likelihood of making a mistake due to one of these chance results.
Historical graphs do not actually help you at all. The conditional probability of seeing a particular historical graph is fixed once you state the current number of observations and significance number. In particular the historical graph gives you no information about the true underlying probability that you did not already have if you know the number of observations and the current significance number.
Say your current significance number currently claims 99% significance in favor of alternative A.
A historical graph can show you that 12 hours ago, there was a stat sig in favor of alternative B (we've seen this plenty of times).
Some metrics take a while to stabilize, and the more metrics you have, the more likely you are to run into these particular situations (as you mention above). A graph helps you understand recent variability in the metric.
Recent variability shouldn't matter. Assuming you don't have have some external factor driving variability (like an email program), the only statistically important fact that past variability tells you you is an indirect estimate of how many people you have in the test. But that is a number that you already have direct access to.
If this this does not seem true for you, then you need to review how you are doing your stats. Because something sounds fishy. Perhaps, for example, you are plugging in the number of times the target event happened instead of the number of people that the target event happened to? Then observations are correlated, which throws off your statistical tests? There are a lot of ways to do the stats wrong, and I like to comment that if you do, then you'll come up with wrong answers - and believe them.
Agreed. You're absolutely right that it's an indirect estimate of how many people you have in the test and doesn't add any extra information.
The only thing I'm suggesting is that looking at a graph that is wildly fluctuating up and down can be more helpful and easily accessible than asking someone who isn't an expert in stats (I raise my hand) to look at a number of participants and immediately understand its effect on this specific experiment's variability.
No, we are tracking the number of people that the target event happened to -- but we've also tracked the number of times it happens, and as mentioned in the article, we are aware that these specific metrics are extremely outlier-prone and are looking into ways of improving this. All advice welcome, your tips are much appreciated.
I do this with Optimizely (YC W10), which has a similar arbitrary-goal JavaScript API. I scatter Optimizely goal calls throughout our tracking scripts, and any test we happen to be running will pick it up.
I really like the suggestion to do A/A tests. A lot of Seth Roberts-style n=1 studies I've done on my nutrition, athletic performance, memory, etc... lead me to completely erroneous conclusions due to not setting the significance threshold high enough before acting on an experiment.
However suppose that you have 20 metrics you are following, and you run 20 tests. The odds are that by chance 4 times you'll have tests showing 99% confidence on random metrics for random results. This is just a side-effect of having many tests and many metrics.
Therefore if you find yourself in that situation, you should be very predisposed to assume that random results you were not expecting on random metrics that seem unconnected to your test really are due to random chance. Because the odds of weird chance results are higher than you would have guessed.