The Grand Locus / Life for statistical sciences


the Blog

One shade of authorship attribution

This article is neither interesting nor well written.

Everybody in the academia has a story about reviewer 3. If the words above sound familiar, you will definitely know what I mean, but for the others I should give some context. No decent scientific editor will accept to publish an article without taking advice from experts. This process, called peer review, is usually anonymous and opaque. According to an urban legend, reviewer 1 is very positive, reviewer 2 couldn't care less, and reviewer 3 is a pain in the ass. Believe it or not, the quote above is real, and it is all the review consists of. Needless to say, it was from reviewer 3.

For a long time, I wondered whether there is a way to trace the identity of an author through the text of a review. What methods do stylometry experts use to identify passages from the Q source in the Bible, or to know whether William Shakespeare had a ghostwriter?

The 4-gram method

Surprisingly, the best stylistic fingerprints have little to do with literary style. For instance, lexical richness and complexity of the language are very difficult to exploit efficiently. The unconscious foibles...

The geometry of style

This is it! I have been preparing this post for a very long time and I will finally tell you what is so special about IMDB user 2467618, also known as planktonrules. But first, let me take you back where we left off in this post series on IMDB reviews.

In the first post I analyzed the style of IMDB reviews to learn which features best predict the grade given to a movie (a kind of analysis known as feature extraction). Surprisingly, the puncutation and the length of the review are more informative than the vocabulary. Reviews that give a medium mark (i.e. around 5/10) are longer and thus contain more full stops and commas.

Why would reviewers spend more time on a movie rated 5/10 than on a movie rated 10/10? There is at least two possibilities, which are not mutually exclusive. Perhaps the absence of a strong emotional response (good or bad) makes the reviewer more descriptive. Alternatively, the reviewers who give extreme marks may not be the same as those who give medium marks. The underlying question is how much does the style of a single reviewer change with his/her...

Are you human?

On the Internet, nobody knows you're a dog.

This is the text of a famous cartoon by Peter Steiner that I reproduced below. This picture marked a turning point in the use of identity on the Internet, when it was realized that you don't have to tell the truth about yourself. The joke in the cartoon pushes it to the limit, as if you do not even have to be human. But is there anything else than humans on the Internet?

Actually yes. The Internet is full of robots or web bots. Those robots are not pieces of metal like Robby the robot. Instead, they are computer scripts that issue network requests and process the response without human intervention. How much of the world traffic those web bots represent is hard to estimate, but sources cited on Wikipedia mention that the vast majority of email is spam (usually sent by spambots), so it might be that humans issue a minority of requests on the Internet.

In my previous post I mentioned that computers do not understand humans. For the same reasons, it is sometimes difficult for a server to determine whether it is processing a request...

The elements of style

Let us continue this series of posts on IMDB reviews. In the previous post I used mutual information to identify a consistent trend in the reviews: very positive and very negative reviews are shorter than average reviews by about 2 sentences. But how can we give a full description of the style of reviews? And, what is style anyway?

Let's refer to the definition.

style /stīl/: A manner of doing something.

So style covers every feature of the text, from lexical (use of the vocabulary) to semantic (meaning attributed to expressions). The question of style has kept the field of Natural Language Processing (NLP) very busy because this is a strong indicator of the content of a text. What is it about? How reliable is it? Who is the author? However, most of the emphasis is on the syntax, because semantics is still a long and painful way ahead. Alan Turing, by his claim that a machine is able to think if it is able to communicate with humans in their natural languages (the Turing test), sparked a general interest for the question of language in the field of artificial intelligence. A bunch of chatting robots...

Lost in punctuation

What is the difference between The Shawshank Redemption and Superbabies: Baby Geniuses 2? Besides all other differences, The Shawshank Redemption is the best movie in the world and Superbabies: Baby Geniuses 2 is the worst, according to IMDB users (check a sample scene of Superbabies: Baby Geniuses 2 if you believe that the worst movie of all times is Plan 9 from Outer Space or Manos: the Hands of Fate).

IMDB users not only rank movies, they also write reviews and this is where things turn really awesome! Give Internet users the space and freedom to express themselves and you get Amazon's Tuscan whole milk or Food Network's late night bacon recipe. By now IMDB reviews have secured their place in the Internet pantheon as you can check from or But as far as I am aware, nobody has taken this data seriously and try to understand what IMDB reviewers have to say. So let's scratch the surface.

I took a random sample of exactly 6,000 titles from the ~ 200,000 feature films on IMDB. This is less than 3% of the total, but this amount is sufficient to...

the Blog
Best of

the Lab
The team
Research lines
Work with us

Blog roll
Simply stats
Bits of DNA