The Grand Locus / Life for statistical sciences

the Blog

## Focus on: multiple testing

With this post I inaugurate the focus on series, where I go much more in depth than usual. I could as well have called it the gory details, but focus on sounds more elegant. You might be in for a shock if you are more into easy reading, so the focus on is also here as a warning sign so that you can skip the post altogether if you are not interested in the detail. For those who are, this way please...

In my previous post I exposed the multiple testing problem. Every null hypothesis, true or false, has at least a 5% chance of being rejected (assuming you work at 95% confidence level). By testing the same hypothesis several times, you increase the chances that it will be rejected at least once, which introduces a bias because this one time is much more likely to be noticed, and then published. However, being aware of the illusion does not dissipate it. For this you need insight and statistical tools.

### Fail-safe $n$ to measure publication bias

Suppose $n$ independent research teams test the same null hypothesis, which happens to be true — so not interesting. This means that the...

## The most dangerous number

I have always been amazed by faith in statistics. The research community itself shakes in awe before the totems of statistics. One of its most powerful idols is the 5% level of significance. I never knew how it could access such a level of universality, but I can I venture a hypothesis. The first statistical tests, such as Student's t test were compiled in statistical tables that gave reference values for only a few levels of significance, typically 0.05, 0.01 and 0.001. This gave huge leverage to editors and especially peer-reviewers (famous for their abusive comments) to reject a scientific work on the ground that it is not even substantiated by the weakest level of significance available. The generation of scientists persecuted for showing p-values equal to 0.06 learned this bitter lesson, and taught it back when they came to the position of reviewer. It then took very little to transform a social punishment into the established truth that 0.06 is simply not significant.

And frankly, I think it was a good thing to enforce a minimum level of statistical reliability. The part I disagree with is the converse statement...

## The reverend’s gambit

Two years after the death of Reverend Thomas Bayes in 1761, the famous theorem that bears his name was published. The legend has it he felt the devilish nature of his result and was too afraid of the reaction of the Church to publish it during his lifetime. Two hundred and fifty years later, the theorem still sparkles debate, but among statisticians.

Bayes theorem is the object of the academic fight between the so-called frequentist and Bayesian schools. Actually, more shocking than this profound disagreement is the overall tolerance for both points of view. After all, Bayes theorem is a theorem. Mathematicians do not argue over the Pythagorean Theorem: either there is a proof or there isn't. There is no arguing about that.

So what's wrong with Bayes theorem? Well, it's the hypotheses. According to the frequentist, the theorem is right, it is just not applicable in the conditions used by the Bayesian. In short, the theorem says that if $A$ and $B$ are events, the probability of $A$ given that $B$ occurred is $P(A|B) = P(B|A) P(A)/P(B)$. The focus of the fight is the term \$(P(B...

## Why p-values are crap

I remember my statistics classes as a student. To do a t-test we had to carry out a series of tedious calculations and in the end look up the value in a table. Making those tables cost an enormous amount of sweat from talented statisticians, so you had only three tables, for three significance levels: 5%, 1% and 0.1%. This explains the common way to indicate significance in scientific papers, with one (*), two (**) or three (***) stars. Today, students use computers to do the calcultations so the star notation probably appears as a mysterious folklore and the idea of using a statistical table is properly unthinkable. And this is a good thing because computing those t statistics by hand was a pain. But statistical softwares also paved the way for the invasion of p-values in the scientific literature.

To understand what is wrong with p-values, we will need to go deeper in the theory of statistical testing, so let us review the basic principles. Every statistical test consists of a null hypothesis, a test statistic (a score) and a decision rule — plus the often forgotten alternative hypothesis. A statistical test is an investigation protocol to...