The Grand Locus / Life for statistical sciences

the Blog

## Is there a gene for alcoholism? (1)

This is usually the next thing I hear when I say that I am a geneticist. Behind this question and its variants lies a profound and natural interrogration, which could be phrased as "how much of me is the product of my genes?" I made a habit of not answering that question but instead, highlight its inaneness by lecturing people about genetics. So, for once, and exclusively on my blog, here is the tl;dr answer: no, there is not. Now comes the lecture about genetics.

I will start with mental retardation — unrelated with my opinion of those claims, really — and more precisely with the fragile X syndrome. James Watson, the co-discoverer of the structure of DNA and the pioneer of the Human Genome Project declared:

I think it was the first triumph of the Human Genome Project. With fragile X we've got just one protein missing, so it's a simple problem. So, you know, if I were going to work on something with the thought that I were going to solve it, oh boy, I'd work on fragile X.

In other words, there seems to be a gene for mental retardation. The incidence...

## The chaos and the doubt

Probability is said to be born of the correspondence between Pierre de Fermat and Blaise Pascal, some time in the middle of the 17th century. Somewhat surprisingly, many texts retrace the history of the concept up until the 20th century; yet it has gone through major transformations since then. Probability always describes what we don't know about the world, but the focus has shifted from the world to what we don't know.

Henri Poincaré investigates in Science et Méthode (1908) why chance would ever happen in a deterministic world. Like most of his contemporaries, Poincaré believed in absolute determinism, there is no phenomenon without a cause, even though our limited minds may fail to understand or see it. He distinguishes two flavors of randomness, of which he gives examples.

If a cone stands on its point we know that it will fall but we do not know which way (...) A very small cause, which escapes us, determines a considerable effect that we can not but see, and then we say that this effect is due to chance.

And a little bit later he continues.

How do we represent a container filled with gas? Countless molecules...

## The autistic computer

I was the shadow of the waxwing slain
By the false azure in the windowpane

What did Vladimir Nabokov see in the first verses of Pale Fire? Was it "weathered wood" or "polished ebony"? As a synesthete, his perception of words, letters and numbers was always tainted with a certain color. Synesthesia, the leak of a sensation into another, is a relatively rare condition. It was known to be more frequent among artists, such as the composer Alexander Scriabin or the painter David Hockney, but it turns out that it might also be frequent among autists. This might even be the reason that some of them have a savant syndrome (a phenomenon first popularized by the movie Rain Man).

One of those autistic savants, Daniel Tammet explains in the video below how he sees the world and how this allows him to carry out extraordinary intellectual tasks.

In his talk, Daniel Tammet explains how he performs a multiplication by analogical thinking. Because he sees a pattern in the numbers, he gives the problem another interpretation, another meaning, where the solution is effortless. This would happen at the level of the semantic representation (i.e. when the brains deciphers...

## The geometry of style

This is it! I have been preparing this post for a very long time and I will finally tell you what is so special about IMDB user 2467618, also known as planktonrules. But first, let me take you back where we left off in this post series on IMDB reviews.

In the first post I analyzed the style of IMDB reviews to learn which features best predict the grade given to a movie (a kind of analysis known as feature extraction). Surprisingly, the puncutation and the length of the review are more informative than the vocabulary. Reviews that give a medium mark (i.e. around 5/10) are longer and thus contain more full stops and commas.

Why would reviewers spend more time on a movie rated 5/10 than on a movie rated 10/10? There is at least two possibilities, which are not mutually exclusive. Perhaps the absence of a strong emotional response (good or bad) makes the reviewer more descriptive. Alternatively, the reviewers who give extreme marks may not be the same as those who give medium marks. The underlying question is how much does the style of a single reviewer change with his/her...

## Are you human?

On the Internet, nobody knows you're a dog.

This is the text of a famous cartoon by Peter Steiner that I reproduced below. This picture marked a turning point in the use of identity on the Internet, when it was realized that you don't have to tell the truth about yourself. The joke in the cartoon pushes it to the limit, as if you do not even have to be human. But is there anything else than humans on the Internet?

Actually yes. The Internet is full of robots or web bots. Those robots are not pieces of metal like Robby the robot. Instead, they are computer scripts that issue network requests and process the response without human intervention. How much of the world traffic those web bots represent is hard to estimate, but sources cited on Wikipedia mention that the vast majority of email is spam (usually sent by spambots), so it might be that humans issue a minority of requests on the Internet.

In my previous post I mentioned that computers do not understand humans. For the same reasons, it is sometimes difficult for a server to determine whether it is processing a request...

## The elements of style

Let us continue this series of posts on IMDB reviews. In the previous post I used mutual information to identify a consistent trend in the reviews: very positive and very negative reviews are shorter than average reviews by about 2 sentences. But how can we give a full description of the style of reviews? And, what is style anyway?

Let's refer to the definition.

style /stīl/: A manner of doing something.

So style covers every feature of the text, from lexical (use of the vocabulary) to semantic (meaning attributed to expressions). The question of style has kept the field of Natural Language Processing (NLP) very busy because this is a strong indicator of the content of a text. What is it about? How reliable is it? Who is the author? However, most of the emphasis is on the syntax, because semantics is still a long and painful way ahead. Alan Turing, by his claim that a machine is able to think if it is able to communicate with humans in their natural languages (the Turing test), sparked a general interest for the question of language in the field of artificial intelligence. A bunch of chatting robots...

## Did Sweden cheat at Eurovision?

In my previous post, I promised to go deeper into IMDB reviews, but I defer it until I deal with a more pressing issue.

Last Saturday was the Eurovision song contest. Somehow, my girlfriend managed to convinced me to sit through it (apologies to my fellow disincarnated academic researchers for such a treachery to our quest for knowledge).

Outside the epic performance of Ireland, already consecrated in the pantheon of memes, the show was plain boring. More specifically, it was redundant. Many songs were duplicates of each other and most were clear wannabes of successful artists (poor Amy, if you had seen what Italy did to you).

It was such a surprise, a shock I should say, that Sweden won the contest with the song Euphoria. Not that it was bad. Rather, that it was exactly like the songs we keep hearing every summer for more than 20 years. So, is this me getting old and not being able to recognize what's good music, or is there something fishy going on? I realized that the voting process is completely opaque and that nothing says that the IT counts the votes in a fair way. It would be...

## Lost in punctuation

What is the difference between The Shawshank Redemption and Superbabies: Baby Geniuses 2? Besides all other differences, The Shawshank Redemption is the best movie in the world and Superbabies: Baby Geniuses 2 is the worst, according to IMDB users (check a sample scene of Superbabies: Baby Geniuses 2 if you believe that the worst movie of all times is Plan 9 from Outer Space or Manos: the Hands of Fate).

IMDB users not only rank movies, they also write reviews and this is where things turn really awesome! Give Internet users the space and freedom to express themselves and you get Amazon's Tuscan whole milk or Food Network's late night bacon recipe. By now IMDB reviews have secured their place in the Internet pantheon as you can check from absolutedreck.com or shittyimdbreviews.tumblr.com. But as far as I am aware, nobody has taken this data seriously and try to understand what IMDB reviewers have to say. So let's scratch the surface.

I took a random sample of exactly 6,000 titles from the ~ 200,000 feature films on IMDB. This is less than 3% of the total, but this amount is sufficient to...

## Poetry and optimality

Claude Shannon was the hell of a scientist. His work in the field of information theory, (and in particular his famous noisy channel coding theorem) shaped the modern technological landscape, but also gave profound insight in the theory of probabilities.

In my previous post on statistical independence, I argued that causality is not a statistical concept, because all that matters to statistics is the sampling of events, which may not reflect their occurrence. On the other hand, the concept of information fits gracefully in the general framework of Bayesian probability and gives a key interpretation of statistical independence.

Shannon defines the information of an event with probability $Prob(A)$ as $-\log P(A)$. For years, this definition baffled me for its simplicity and its abstruseness. Yet it is actually intuitive. Let us call $\Omega$ the system under study and $\omega$ its state. You can think of $\Omega$ as a set of possible messages and of $\omega$ as the true message transmitted over a channel, or (if you are Bayesian) of $\Omega$ as a parameter set and $\omega$ as the true value of the parameter. We have total information about the system if we know $\omega$. If instead, all...

## The fallacy of (in)dependence

In the post Why p-values are crap I argued that independence is a key assumption of statistical testing and that it almost never holds in practical cases, explaining how p-values can be insanely low even in the absence of effect. However, I did not explain how to test independence. As a matter of fact I did not even define independence because the concept is much more complex than it seems.

Apart from the singular case of Bayes theorem, which I referred to in my previous post, the many conflicts of probability theory have been settled by axiomatization. Instead of saying what probabilities are, the current definition says what properties they have. Likewise, independence is defined axiomatically by saying that events $A$ and $B$ are independent if $P(A \cap B) = P(A)P(B)$, or in English, if the probability of observing both is the product of their individual probabilities. Not very intuitive, but if we recall that $P(A|B) = P(A \cap B)/P(B)$, we see that an alternative formulation of the independence of $A$ and $B$ is $P(A | B) = P(A)$. In other words, if $A$ and $B$ are independent, observing...

« | »