•

the Blog

## Lost in punctuation

By Guillaume Filion, filed under
information,
series: IMDB reviews,
information retrieval,
IMDB,
movies.

• 26 May 2012 •

What is the difference between The Shawshank Redemption and Superbabies: Baby Geniuses 2? Besides all other differences, *The Shawshank Redemption* is the best movie in the world and *Superbabies: Baby Geniuses 2* is the worst, according to IMDB users (check a sample scene of *Superbabies: Baby Geniuses 2* if you believe that the worst movie of all times is Plan 9 from Outer Space or Manos: the Hands of Fate).

IMDB users not only rank movies, they also write reviews and this is where things turn really awesome! Give Internet users the space and freedom to express themselves and you get Amazon's Tuscan whole milk or Food Network's late night bacon recipe. By now IMDB reviews have secured their place in the Internet pantheon as you can check from absolutedreck.com or shittyimdbreviews.tumblr.com. But as far as I am aware, nobody has taken this data seriously and try to understand what IMDB reviewers have to say. So let's scratch the surface.

I took a random sample of exactly 6,000 titles from the ~ 200,000 feature films on IMDB. This is less than 3% of the total, but this amount is sufficient to...

## Poetry and optimality

By Guillaume Filion, filed under
information,
independence,
probability.

• 21 May 2012 •

Claude Shannon was the hell of a scientist. His work in the field of information theory, (and in particular his famous noisy channel coding theorem) shaped the modern technological landscape, but also gave profound insight in the theory of probabilities.

In my previous post on statistical independence, I argued that causality is not a statistical concept, because all that matters to statistics is the sampling of events, which may not reflect their occurrence. On the other hand, the concept of information fits gracefully in the general framework of Bayesian probability and gives a key interpretation of statistical independence.

Shannon defines the information of an event with probability $(Prob(A))$ as $(-\log P(A))$. For years, this definition baffled me for its simplicity and its abstruseness. Yet it is actually intuitive. Let us call $(\Omega)$ the system under study and $(\omega)$ its state. You can think of $(\Omega)$ as a set of possible messages and of $(\omega)$ as the true message transmitted over a channel, or (if you are Bayesian) of $(\Omega)$ as a parameter set and $(\omega)$ as the true value of the parameter. We have total information about the system if we know $(\omega)$. If instead, all...

## The fallacy of (in)dependence

By Guillaume Filion, filed under
information,
causality,
probability,
independence.

• 04 May 2012 •

In the post Why p-values are crap I argued that independence is a key assumption of statistical testing and that it almost never holds in practical cases, explaining how p-values can be insanely low even in the absence of effect. However, I did not explain how to test independence. As a matter of fact I did not even *define* independence because the concept is much more complex than it seems.

Apart from the singular case of Bayes theorem, which I referred to in my previous post, the many conflicts of probability theory have been settled by axiomatization. Instead of saying what probabilities *are*, the current definition says what properties they have. Likewise, independence is defined axiomatically by saying that events $(A)$ and $(B)$ are independent if $(P(A \cap B) = P(A)P(B))$, or in English, if the probability of observing both is the product of their individual probabilities. Not very intuitive, but if we recall that $(P(A|B) = P(A \cap B)/P(B))$, we see that an alternative formulation of the independence of $(A)$ and $(B)$ is $(P(A | B) = P(A))$. In other words, if $(A)$ and $(B)$ are independent, observing...

## The reverend’s gambit

By Guillaume Filion, filed under
Bayesian statistics,
probability,
p-values.

• 22 April 2012 •

Two years after the death of Reverend Thomas Bayes in 1761, the famous theorem that bears his name was published. The legend has it he felt the devilish nature of his result and was too afraid of the reaction of the Church to publish it during his lifetime. Two hundred and fifty years later, the theorem still sparkles debate, but among statisticians.

Bayes theorem is the object of the academic fight between the so-called frequentist and Bayesian schools. Actually, more shocking than this profound disagreement is the overall tolerance for both points of view. After all, Bayes theorem is a theorem. Mathematicians do not argue over the Pythagorean Theorem: either there is a proof or there isn't. There is no *arguing* about that.

So what's wrong with Bayes theorem? Well, it's the hypotheses. According to the frequentist, the theorem is right, it is just not applicable in the conditions used by the Bayesian. In short, the theorem says that if $(A)$ and $(B)$ are events, the probability of $(A)$ given that $(B)$ occurred is $(P(A|B) = P(B|A) P(A)/P(B))$. The focus of the fight is the term $(P(B...

## Why p-values are crap

By Guillaume Filion, filed under
R,
random walks,
probability,
p-values.

• 03 April 2012 •

I remember my statistics classes as a student. To do a t-test we had to carry out a series of tedious calculations and in the end look up the value in a table. Making those tables cost an enormous amount of sweat from talented statisticians, so you had only three tables, for three significance levels: 5%, 1% and 0.1%. This explains the common way to indicate significance in scientific papers, with one (*), two (**) or three (***) stars. Today, students use computers to do the calcultations so the star notation probably appears as a mysterious folklore and the idea of using a statistical table is properly unthinkable. And this is a good thing because computing those t statistics by hand was a pain. But statistical softwares also paved the way for the invasion of p-values in the scientific literature.

To understand what is wrong with p-values, we will need to go deeper in the theory of statistical testing, so let us review the basic principles. Every statistical test consists of a null hypothesis, a test statistic (a score) and a decision rule — plus the often forgotten alternative hypothesis. A statistical test is an investigation protocol to...

## The Brownian labyrinth

By Guillaume Filion, filed under
stochastic processes,
R,
random walks.

• 28 March 2012 •

Architecture and art show that human culture often uses the same basic shapes. Among them, labyrinth is an outsider for its complexity. Made famous by the Greek myth of Theseus and the Minotaur, labyrinths are found in virtually every culture and every era. The Wikipedia entry of labyrinth shows different designs, but they all have in common the intricate folding of a path onto itself, in such a way that the distance you have to walk inside the labyrinth is much larger than your actual displacement in space.

Fictions of all genres are also fraught with labyrinths. Perhaps one of the most vivid appearance of the labyrinth theme in literature is The Garden of Forking Paths by Borges. In this short story, Borges evokes a perfect labyrinth. Like in gamebooks, this special book that follows every possible ramification of the plot, and not just one. In some passages the hero dies, in some others he lives, in such a way that one can read the novel in infinitely many ways.

An invisible labyrinth of time. To me, a barbarous Englishman, has been entrusted the revelation of this diaphanous mystery. After more than a hundred years, the details are...

## Drunk man walking

By Guillaume Filion, filed under
stochastic processes,
R,
probability,
random walks.

• 15 March 2012 •

Lotteries fascinate the human mind. In the The Lottery in Babylon, Jorge Luis Borges describes a city where the lottery takes a progressively dominant part in people’s life, to the extent that every decision, even life and death, becomes subject to the lottery.

In this story, Borges brings us face to face with the discomfort that the concept of randomness creates in our mind. Paradoxes are like lighthouses, they indicate a dangerous reef, where the human mind can easily slip and fall into madness, but they also show us the way to greater understanding.

One of the oldest paradoxes of probability theory is the so called Saint Petersburg paradox, which has been teasing statisticians since 1713. Imagine I offered you to play the following game: if you toss ‘tails’, you gain $1, and as long as you toss ‘tails’, you double your gains. The first ‘heads’ ends the spree and determines how much you gain. So you could gain $0, $1, $2, $4, $8... with probability 1/2, 1/4, 1/8, 1/16, 1/32 etc. What is the fair price I can ask you to play the Saint Petersburg lottery?

Probability theory says that the...

## What’s in a title?

By Guillaume Filion, filed under
PubMed,
journals,
information retrieval.

• 10 March 2012 •

Trying to come up with a name for the blog, I wondered what a good title should be. If you ever wrote a scientific article, you probably found yourself in the same situation. You try to surf the trend, mix in carefully selected buzzwords and present the work under its sexiest side. Sexy, that is, to the veterans. Admittedly, not everyone will crave to read “Epithelial cell adhesion molecule (EpCAM) complex proteins promote transcription factor-mediated pluripotency reprogramming” (no offense intended, I just took the first title that showed up in PubMed).

Meta-analysis of scientific literature tells us a lot about how science and scientific discourse change over time. A simple title word analysis of the articles published in Nature in an 8 year interval shows how some topics fell from grace, whereas others rose to the top.

The struggle-for-hype allows us to tell what scientists and editors find exciting at a given time. To play with this idea, I collected all the titles of the Nature articles published in 2002 and in 2010, and ran Wordle on them. The size of a word in the cloud is proportional to its occurrence in the corpus...

## I’m the boss!!

By Guillaume Filion, filed under
motivation,
journals.

• 08 March 2012 •

“You know how scientists will communicate in the future, don’t you?

— Of course I do!”

It is a shameless lie, I have no clue what Frédéric has in mind, but I don’t want to look stupid.

“And you Vincent, I bet you know it too... right?”

On this afternoon of 2004, somewhere on the south coast of Madagascar, Vincent gives one of his majestic *puzzled* looks. That was exactly what Frédéric had hoped for.

“Well, in the future, scientists will no longer publish in journals. They will have public lab notes. They will post their results on their personal Internet page day by day... like a blog. Peer scientists will be allowed to leave their comments, criticize the protocols and the results. In short, the information will go directly from the producers to the consumers, and it will spread much better because science will become open source.”

I believed it then.

But now, I just became an independent researcher. I have my own team. I am the boss!! And I realize that Frédéric was wrong. To stay in research you need a good track record. And as far as track...

« Newer