The Grand Locus / Life for statistical sciences

the Blog

## Did Sweden cheat at Eurovision?

In my previous post, I promised to go deeper into IMDB reviews, but I defer it until I deal with a more pressing issue.

Last Saturday was the Eurovision song contest. Somehow, my girlfriend managed to convinced me to sit through it (apologies to my fellow disincarnated academic researchers for such a treachery to our quest for knowledge).

Outside the epic performance of Ireland, already consecrated in the pantheon of memes, the show was plain boring. More specifically, it was redundant. Many songs were duplicates of each other and most were clear wannabes of successful artists (poor Amy, if you had seen what Italy did to you).

It was such a surprise, a shock I should say, that Sweden won the contest with the song Euphoria. Not that it was bad. Rather, that it was exactly like the songs we keep hearing every summer for more than 20 years. So, is this me getting old and not being able to recognize what's good music, or is there something fishy going on? I realized that the voting process is completely opaque and that nothing says that the IT counts the votes in a fair way. It would be...

## Lost in punctuation

What is the difference between The Shawshank Redemption and Superbabies: Baby Geniuses 2? Besides all other differences, The Shawshank Redemption is the best movie in the world and Superbabies: Baby Geniuses 2 is the worst, according to IMDB users (check a sample scene of Superbabies: Baby Geniuses 2 if you believe that the worst movie of all times is Plan 9 from Outer Space or Manos: the Hands of Fate).

IMDB users not only rank movies, they also write reviews and this is where things turn really awesome! Give Internet users the space and freedom to express themselves and you get Amazon's Tuscan whole milk or Food Network's late night bacon recipe. By now IMDB reviews have secured their place in the Internet pantheon as you can check from absolutedreck.com or shittyimdbreviews.tumblr.com. But as far as I am aware, nobody has taken this data seriously and try to understand what IMDB reviewers have to say. So let's scratch the surface.

I took a random sample of exactly 6,000 titles from the ~ 200,000 feature films on IMDB. This is less than 3% of the total, but this amount is sufficient to...

## Poetry and optimality

Claude Shannon was the hell of a scientist. His work in the field of information theory, (and in particular his famous noisy channel coding theorem) shaped the modern technological landscape, but also gave profound insight in the theory of probabilities.

In my previous post on statistical independence, I argued that causality is not a statistical concept, because all that matters to statistics is the sampling of events, which may not reflect their occurrence. On the other hand, the concept of information fits gracefully in the general framework of Bayesian probability and gives a key interpretation of statistical independence.

Shannon defines the information of an event with probability $Prob(A)$ as $-\log P(A)$. For years, this definition baffled me for its simplicity and its abstruseness. Yet it is actually intuitive. Let us call $\Omega$ the system under study and $\omega$ its state. You can think of $\Omega$ as a set of possible messages and of $\omega$ as the true message transmitted over a channel, or (if you are Bayesian) of $\Omega$ as a parameter set and $\omega$ as the true value of the parameter. We have total information about the system if we know $\omega$. If instead, all...

## The fallacy of (in)dependence

In the post Why p-values are crap I argued that independence is a key assumption of statistical testing and that it almost never holds in practical cases, explaining how p-values can be insanely low even in the absence of effect. However, I did not explain how to test independence. As a matter of fact I did not even define independence because the concept is much more complex than it seems.

Apart from the singular case of Bayes theorem, which I referred to in my previous post, the many conflicts of probability theory have been settled by axiomatization. Instead of saying what probabilities are, the current definition says what properties they have. Likewise, independence is defined axiomatically by saying that events $A$ and $B$ are independent if $P(A \cap B) = P(A)P(B)$, or in English, if the probability of observing both is the product of their individual probabilities. Not very intuitive, but if we recall that $P(A|B) = P(A \cap B)/P(B)$, we see that an alternative formulation of the independence of $A$ and $B$ is $P(A | B) = P(A)$. In other words, if $A$ and $B$ are independent, observing...