How frequently do you see in the headlines that scientists have discovered that tomato juice reduces the chances of Parkinson’s disease, that red wine does or does not reduce the risk of heart disease, or that salmon is good for your brain? While statements like these may well be true, they tend to come together as a random collection of disconnected datasets assessed using standard statistical tools.
Of course, therein lies at least one major rub inherent to this piecemeal approach. If I come up with twenty newsworthy illnesses and then devise one clinical trial to assess the effectiveness of some substance for fighting them, I am highly likely to come up with a statistically valid result. This is in fact true even if the substance I am providing does absolutely nothing. While the placebo effect could account for this, the more important reason is much more basic:
Statistical evaluation in clinical trials is done using a method called hypothesis testing. Let’s say I want to evaluate the effect of pomegranate juice on memory. I come up with two groups of volunteers and some kind of memory test, then give the juice to half the volunteers and a placebo that is somehow indistinguishable to the others. Then, I give out the tests and collect scores. Now, it is possible that – entirely by chance – one group will outperform the other, even if they are both randomly selected and all the trials are done double-blind. As such, what statisticians do is start with the hypothesis that pomegranate juice does nothing: this is called the null hypothesis. Then, you look at the data and decide how likely it is that you got the data you did, even if pomegranate juice does nothing. The more unlikely it is that your null hypothesis is false, given the data, the more likely the converse is true.
If, for instance, we gave this test to two million people, all randomly selected, and the ones who got the pomegranate juice did twice as well in almost every case, it would seem very unlikely that pomegranate juice has no effect. The question, then, is where to set the boundary between data that is consistent with the null hypothesis and data that allows us to reject it. For largely arbitrary reasons, it is usually set at 95%. That means, there is a chance of 5% or less that the null hypothesis is true – pomegranate juice does nothing – in spite of the data which seem to indicate the converse.
More simply, let’s imagine that we are rolling a die and trying to evaluate whether it is fair or not. If we roll it twice and get two sixes, we might be a little bit suspicious. If we roll it one hundred times and get all sixes, we will become increasingly convinced the die is rigged. It’s always possible that we keep getting sixes by random chance, the the probability falls with each additional piece of data we collect that indicates otherwise. The number of trials we do before the decide that the die is rigged is the basis for our confidence level.1
The upshot of this, going back to my twenty diseases, is that if you do these kinds of studies over and over again, you will incorrectly identify a statistically significant effect 5% of the time. Because that’s the confidence level you have chosen, you will always get that many false positives (instances where you identify an effect that doesn’t actually exist). You could set the confidence level higher, but that requires larger and more expensive studies. Indeed, moving from 95% confidence to 99% of higher can often require increasing the sample size by one hundred-fold or more. That is cheap enough when you’re rolling dice, but it gets extremely costly when you have hundreds of people being experimented upon.
My response to all of this is to demand the presence of some comprehensible causal mechanism. If we test twenty different kinds of crystals to see if adhering one to a person’s forehead helps their memory, we should find that one in twenty works, based on a 95% confidence level. That said, we don’t have any reasonable scientific explanation of why this should be so. If we have a statistically established correlation but no causal understanding, we should be cautious indeed. Of course, it’s difficult to learn these kinds of things from the sort of news story I was describing at the outset.
[1] If you’re interested in the mathematics behind all of this, just take a look at the first couple of chapters of any undergraduate statistics book. As soon as I broke out any math here, I’d be liable to scare off the kind of people who I am trying to teach this to – people absolutely clever enough to understand these concepts, but who feel intimidated by them.