•

the Blog

## The curse of large numbers (Big Data considered harmful)

By Guillaume Filion, filed under
statistics,
hypothesis testing,
big data,
p-values.

• 10 February 2018 •

According to the legend, King Midas got the sympathy of the Greek god Dionysus who offered to grant him a wish. Midas asked that everything he touches would turn into gold. At first very happy with his choice, he realized that he had brought on himself a curse, as his food turned into gold before he could eat it.

This legend on the theme “be careful what you wish for” is a cautionary tale about using powers you do not understand. The only “powers” humans ever acquired were technologies, so one can think of this legend as a warning against modernization and against the fact that some things we take for granted will be lost in our desire for better lives.

In data analysis and in bioinformatics, modernization sounds like “Big Data”. And indeed, Big Data is everything we asked for. No more expensive underpowered studies! No more biased small samples! No more invalid approximations! No more p-hacking! Data is good, and more data is better. If we have too much data, we can always throw it away. So what can possibly go wrong with Big Data?

Enter the Big Data world and everything you touch turns...

## One or two tails?

By Guillaume Filion, filed under
statistics,
p-hacking.

• 11 December 2016 •

Here is a discussion that I recently had with my colleague John. He approached me with the following request:

“I sent a manuscript to Nature and it is going quite well. Actually the reviewers are rather positive, but one of them asks us to justify better why we used a one-tailed *t* test to find the main result. What should I write in the methods section?

— It depends. Why did you use a one-tailed *t* test?

— Well, we first tried the standard *t* test, but it was borderline significant. My student realized that if we used the one-tailed *t* test, the result was significant so we settled for this variant. We specified this clearly in the text, and I am now surprised that I have to justify it. Isn’t it just an accepted variant of the *t* test?

— To be honest, I understand your confusion. The guidelines are rather ill-defined. Actually, Nature journals make it worse by requesting this information for *every* test, even for those that are only one-tailed like the chi-square.

— OK, but what should I do now? For instance, how do *you* justify using a one-tailed *t...*

## Did Mendel fake his results?

By Guillaume Filion, filed under
statistics,
p-hacking,
genetics,
fraud.

• 11 April 2016 •

You went to high school and you learned genetics. You heard about a certain Gregor Mendel who crossed peas and came up with the idea that there is a dominant and a recessive allele. You did not particularly like the guy because there would always be a question about peas with recessive and dominant alleles at the exam. But you grew up, became wiser and just as you started to like him, you heard from someone that he faked his data. You felt disoriented for a while, why annoy you with this stuff at school if it is wrong? But then you came to the conclusion that he just got lucky and that he was right for the wrong reasons. After all, he was just a monk on gardening duties, why would you expect him to understand anything about real science?

### Gregor Mendel

Gregor Mendel was a monk, but he was also a trained scientist. He studied assiduously for twelve years (including about seven years on physics and mathematics), to then become a teacher of physics and natural sciences at the gymnasium of Brno. He prepared his most famous experiment for two years, meticulously checking and choosing his...

## Bayesian networks and causation

By Guillaume Filion, filed under
statistics,
Bayesian networks,
causes,
correlation.

• 20 June 2015 •

The first thing you learn in statistics is that “correlation does not imply causation”. As obvious as it sounds, most human mistakes fall in this category, and not only in statistics. The major difficulty with this question is that it is fairly easy to define correlation, but it is much harder to define causation, let alone quantify it. No surprise many statisticians just avoid talking about causation to stay out of the danger zone.

However, for Judea Pearl, this is not a satisfactory answer. In his book Causality: Models, Reasoning, and Inference, he expresses his opinion vividly.

I see no greater impediment to scientific progress than the prevailing practice of focusing all our mathematical resources on probabilistic and statistical inferences while leaving causal considerations to the mercy of intuition and good judgment.

This book lays the foundation of the now popular Bayesian networks. The key idea is that you can distinguish correlation from causation if you can observe several independent causes. For instance, suppose that patients suffering from a certain type of cancer are often immunodeficient. You wonder whether immunodeficiency is a cause or a consequence of this cancer type.

Say that variable A is whether patients have...

## (Mis)using the KS test for p-hacking

By Guillaume Filion, filed under
statistics,
p-hacking.

• 20 September 2014 •

**Update:**I have published a more academic version of this story in GigaScience, under the title The signed Kolmogorov-Smirnov test: why it should not be used. The reviewers (Garrett Jenkinson and Desmond Campbell) have pointed out that the

*t*-test is more appropriate than the Wilcoxon-Mann-Whitney test as a replacement of the signed Kolmogorov-Smirnov test. They also mentioned that the signed Kolmogorov-Smirnov is short on power, which is yet another reason not to use it. A great thing about GigaScience is that reviews are open, so you can access the discussion in the pre-publication history of the article.

A colleague of mine (let’s call him John) recently put me in a difficult situation. John is a very good immunologist who, as nearly everybody in the field, had to embrace the “omics” revolution. Spirited and curious, he has taken the time to look more closely into statistics and he now has an understanding of most popular parametric and nonparametric tests. One day, he came to me with the following situation.

“I have this gene expression data, you see... I know that a gene is up-regulated, but it is just not significant...