The Grand Locus / Life for statistical sciences

## The most dangerous number

I have always been amazed by faith in statistics. The research community itself shakes in awe before the totems of statistics. One of its most powerful idols is the 5% level of significance. I never knew how it could access such a level of universality, but I can I venture a hypothesis. The first statistical tests, such as Student's t test were compiled in statistical tables that gave reference values for only a few levels of significance, typically 0.05, 0.01 and 0.001. This gave huge leverage to editors and especially peer-reviewers (famous for their abusive comments) to reject a scientific work on the ground that it is not even substantiated by the weakest level of significance available. The generation of scientists persecuted for showing p-values equal to 0.06 learned this bitter lesson, and taught it back when they came to the position of reviewer. It then took very little to transform a social punishment into the established truth that 0.06 is simply not significant.

And frankly, I think it was a good thing to enforce a minimum level of statistical reliability. The part I disagree with is the converse statement, namely that a p-value lower than 0.05 is significant and indicative of a highly non random underlying phenomenon. This is not true. And yet as soon as they can show one p-value equal to 0.05, it is impossible to convince scientists that there still might be nothing interesting in their datasets. This behaviour is known as confirmation bias. People give more importance to the observations and the arguments that confirm their beliefs, and tend to unconsciously disregard the rest. But for an unbiased mind like yours, the catch will be obvious.

By definition, when the null hypothesis is true, i.e. when nothing interesting is happening in a dataset, there is a 5% chance that the p-value will be lower than 0.05. It is not immediately obvious, but a direct consequence of this is that on average any scientific or medical hypothesis will be validated if 20 scientists work on it. If the hypothesis is false (here I mean that the null is true), each scientist has a 5% chance of validating it by mistake. If the hypothesis is true, each scientist has a > 5% chance of validating it. Whether the hypothesis is true or not, we expect that at least one scientist will validate it. I could not illustrate this issue better than the xkcd comics that I reproduce at the end of the post. There are 20 colors of jelly beans, so at level 5% we indeed expect that eating one color will correlate with acne. The same would hold true for cancer, heart failure, winning the lottery or having a name that starts with 'A'.

This situation is known as the multiple testing problem. Put simply, if you test the same hypothesis several times, you will reject the null and validate the hypothesis. And this is where confirmation bias kicks in. You naturally give more credence to the evidence that supports your hypothesis, which means that you will convince yourself of anything if you keep testing it. Another bias, much more important than the confirmation bias is the publication bias. It is very difficult, or impossible to publish negative academic results, where negative usually stands for the null being true. We have seen that a hypothesis tested by 20 scientists will typically be confirmed, so there will be one report that confirms it, and no report that disproves it, even though there are 19 studies that came to this conclusion.

The multiple testing problem becomes even more acute in genomics, where the data deluge is so big that a single researcher typically performs tens of thousands of statistical tests per day, with the help of computers. This means that hundreds of meaningless patterns will pop up in the study and feed the confirmation biases of anybody who comes into contact with the dataset. Fortunately, this problem is not new, neither restricted to the field of genomics, and clever solutions have been proposed, which I will discuss in my next post.