The Grand Locus / Life for statistical sciences

Subscribe...


the Blog

Bayesian networks and causation

The first thing you learn in statistics is that “correlation does not imply causation”. As obvious as it sounds, most human mistakes fall in this category, and not only in statistics. The major difficulty with this question is that it is fairly easy to define correlation, but it is much harder to define causation, let alone quantify it. No surprise many statisticians just avoid talking about causation to stay out of the danger zone.

However, for Judea Pearl, this is not a satisfactory answer. In his book Causality: Models, Reasoning, and Inference, he expresses his opinion vividly.

I see no greater impediment to scientific progress than the prevailing practice of focusing all our mathematical resources on probabilistic and statistical inferences while leaving causal considerations to the mercy of intuition and good judgment.

This book lays the foundation of the now popular Bayesian networks. The key idea is that you can distinguish correlation from causation if you can observe several independent causes. For instance, suppose that patients suffering from a certain type of cancer are often immunodeficient. You wonder whether immunodeficiency is a cause or a consequence of this cancer type.

Say that variable A is whether patients have...






What is bioinformatics about?

A brief note published a few weeks ago initiated a discussion on the blogosphere about who is a bioinformatician and who is not. According to Wikipedia

bioinformatics combines computer science, statistics, mathematics, and engineering to study and process biological data.

To see how the community defines itself, I downloaded the abstracts of Bioinformatics from 2014 (for a total around 1000 articles, extracted the most relevant keywords and put the top 100 terms in a word cloud where a word size shows its frequency*.

Obviously, bioinformatics is about data, mostly gene/protein sequences and expression. It is also good to know that bioinformatics likes genomes and networks, and that it has more affinities with structural biology than with evolution.

The favourite organism of bioinformaticians is Homo sapiens, actually it is the only one mentioned in the word cloud, and when bioinformaticians work on a disease, it is cancer.

When bioinformaticians describe their work, it is “novel” and “new”, and what they talk about is “biological”, “different”, “multiple” and “single” (the last two are usually followed by “sequence alignment” and “nucleotide polymorphism”). It is also “functional” and “available”, but somewhat less. I expected it to be “fast” and “accurate”, but...






Why do bioinformatics?

I never planned to do bioinformatics. It just happened because I liked the time in front of my computer. Still, as every sane individual, I sometimes think that I could do something else with my life, and I wonder whether I am doing the right thing. On this topic, I recently came across the famous farewell to bioinformatics by Frederick J. Ross, which is worth reading, and of which the most emblematic quote is definitely the following.

My attitude towards the subject after all my work in it can probably be best summarized thus: Fuck you, bioinformatics. Eat shit and die.

There is nothing to agree or disagree in this quote, but Frederick gives further detail about his point of view in the post. In short, bioinformaticians are bad programmers, and community-level obfuscation maintains the illusion.

By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques and the slowest languages, by not publishing their algorithms and making their results impossible to replicate, the field managed to reduce its productivity by at least 90%, probably closer to 99%.

There are indeed many issues in the bioinformatics community and I...






If cars were made by bioinformaticians...

1. Cars would have nice names

Here is what an abstract describing a car would look like.

Transporting people to defined locations of interest is a challenge of significant economic importance. To achieve this goal, people usually use cars or public transport. However, these solutions are suboptimal in several conditions. For instance, when people are extremely close to their target location, both cars and public transport are inappropriate, which limits their practical use. Here we present CaЯ (vehiCle for chAnging geo-cooЯdinates), a fast and accurate tool as an alternative to existing vehicles.

2. Cars would be fast and accurate

Bioinformaticians develop fast and accurate software. Their cars would be just the same. Here is what a typical benchmark sections would look like.

To show that CaЯ is faster and more accurate than existing alternatives, we benchmarked CaЯ against Volvo XC90 and Ferrari F430. In the first series of tests, we measured the time to lower the front windows of the vehicles. The average run duration was 2.3 seconds for CaЯ, 3.1 seconds for Volvo XC90 and 3.9 seconds for Ferrari F430, which demonstrates that CaЯ is substantially faster than Volvo XC90 and Ferrari...






Longest runs and DNA alignments

The problem of sequence alignment gets a lot of attention from bioinformaticians (the list of alignment software counts more than 200 entries). Yet, the statistical aspect of the problem is often neglected. In the post Once upon a BLAST, David Lipman explained that the breakthrough of BLAST was not a new algorithm, but the careful calibration of a heuristic by a sound statistical framework.

Inspired by this idea, I wanted to work out the probability of identifying best hits in the problem of long read alignments. Since this is a fairly general result and that it may be useful for many similar applications, I post it here for reference.

Longest runs of 1s

I start with generalities on series of 0s and 1s and focus on the distribution of the longest run of 1s. This general problem has many applications, and I will explain why it is important for sequence alignment in the next section.

Assume that we have a Bernoulli sequence of length $(n)$ such that the probability of a 1 is $(p)$ and the probability of a 0 is $(1-p = q)$. Let $(X_0)$ be the longest run of 1s in the series. As $(n)$ increases...






Why Linux is awesome

Before you rush to the comments and express your opinion about this title, let me make something clear. I do not expect anybody to erase their operating system and install Linux after reading this post. Actually, I do not care whether you use Linux or another operating system. All I want is to share what I have learned by using Linux, and why it made me a better scientist.

The mouse that infects your brain

About a year after I started to use Linux, I was surprised to realize how uncomfortable working on Mac and Windows had become. I could not quite pinpoint the problem, but I had the vague feeling that something was missing. Yet everything seemed to be there. When I could finally formulate it, I realized that what was bugging me was the discomfort that all the possible options had been preconceived for me. I could click on option A, I could click on option B, and if I liked neither of those, there was no option C.

But most surprising was that I had never realized this before, because I had no idea of all the things you can do with your computer. I...






Nature on code sharing

A recent Nature editorial entitled “Code share” discusses an update in Nature’s policy regarding the use of software. Interestingly, the subtitle is

Papers in Nature journals should make computer code accessible where possible.

Yes, finally! The last decade was a transition period, which, in the history of bioinformatics will probably be known as the “bioinformatics revolution”. Following the completion of the first genome projects, the demand for bioinformatics rose steadily, to the detriment of biochemistry and genetics, which have now fallen from grace. Something as traumatic cannot happen in a day, and it cannot happen without pain. Actually, the transition is still ongoing and this regularly causes difficulties of all kinds in biology.

One of the most perverse effects of the massive popularization of bioinformatics is that senior scientists were not properly trained for it. This led to an implicit view that bioinformatics is a tool, somewhat like a microscope or a FACS. This explains why the materials and methods section of the first papers using bioinformatics was often reduced to something like “all the bioinformatics analysis were performed using R”. In other words, “we got some bioinformatics software and asked a qualified technician to use it...






Bioinformatics without Excel

About a year after setting up my laboratory, an observation suddenly hit me. All the job applicants were biologists who wanted to do bioinformatics. I was myself trained as an experimental biologist and started bioinformatics during my post-doc. They saw in my laboratory the opportunity to do the same. Indeed, “how did you become a bioinformatician?” is a question that I hear very often.

For lack of a better plan, most people grab a book about Linux or sign up for a Coursera class, try to do a bit every day and... well, just learn bioinformatics. I have seen extremely few people succeed this way. The content inevitably becomes too difficult, motivation decreases and other commitments take over. I will not lie, self-learning bioinformatcs is hard and it is frustrating... but it can be fun if you know how to do it. And most importantly, if you understand your worst enemy: yourself.

Here is a small digest of how it happened for me. I do not mean that this is the only way. I simply hope that this will be useful to those who seriously want to dive into bioinformatics.

Step 1. Get out of your...






A flurry of copycats on PubMed

It started with a search for trends on PubMed. I am not sure what I expected to find, but it was nothing like the “CISCOM meta-analyses”. Here is the story of how my colleague Lucas Carey (from Universitat Pompeu Fabra) and myself discovered a collection of disturbingly similar scientific papers, and how we got to the bottom of it.

Pattern breaker

CISCOM is the medical publication database of the Research Council for Complementary Medicine. Available since 1995, it used to be mentioned in 2 to 3 papers per year, until Feburary 2014 when the number of hits started to skyrocket. Since then, “CISCOM” surfs a tsunami of one new hit per week.

But this is not what drew my attention, such waves are not unheard of on PubMed. For instance, the progression of CRISPR/Cas9, is more impressive. It was the titles of the hits that convinced me that something fishy was going on: all of them are on the model “something and something else: a meta-analysis”.

The strange pattern caught my attention, but I somehow missed its significance and put this in the back of my mind. It was only later that Lucas convinced me...






(Mis)using the KS test for p-hacking

Update: I have published a more academic version of this story in GigaScience, under the title The signed Kolmogorov-Smirnov test: why it should not be used. The reviewers (Garrett Jenkinson and Desmond Campbell) have pointed out that the t-test is more appropriate than the Wilcoxon-Mann-Whitney test as a replacement of the signed Kolmogorov-Smirnov test. They also mentioned that the signed Kolmogorov-Smirnov is short on power, which is yet another reason not to use it. A great thing about GigaScience is that reviews are open, so you can access the discussion in the pre-publication history of the article.

A colleague of mine (let’s call him John) recently put me in a difficult situation. John is a very good immunologist who, as nearly everybody in the field, had to embrace the “omics” revolution. Spirited and curious, he has taken the time to look more closely into statistics and he now has an understanding of most popular parametric and nonparametric tests. One day, he came to me with the following situation.

“I have this gene expression data, you see... I know that a gene is up-regulated, but it is just not significant...






»

the Blog
Best of
Archive
About

the Lab
The team
Research lines
Work with us

Blog roll
Simply stats
ACGT
atcgeek
opiniomics
cryptogenomicon
Bits of DNA