The Grand Locus / Life for statistical sciences



Lost in punctuation

What is the difference between The Shawshank Redemption and Superbabies: Baby Geniuses 2? Besides all other differences, The Shawshank Redemption is the best movie in the world and Superbabies: Baby Geniuses 2 is the worst, according to IMDB users (check a sample scene of Superbabies: Baby Geniuses 2 if you believe that the worst movie of all times is Plan 9 from Outer Space or Manos: the Hands of Fate).

IMDB users not only rank movies, they also write reviews and this is where things turn really awesome! Give Internet users the space and freedom to express themselves and you get Amazon's Tuscan whole milk or Food Network's late night bacon recipe. By now IMDB reviews have secured their place in the Internet pantheon as you can check from or But as far as I am aware, nobody has taken this data seriously and try to understand what IMDB reviewers have to say. So let's scratch the surface.

I took a random sample of exactly 6,000 titles from the ~ 200,000 feature films on IMDB. This is less than 3% of the total, but this amount is sufficient to capture the major trends. I will not describe the crawler that I used to download the reviews. Suffice it to say that it is too dirty to be proud of, and that I may post about the art of crawling in the future. Two thirds of the movies did not have any review, but I still gathered a total of 37,461 reviews for 1916 films. A corpus about 8 times the size of the full Harry Potter series.

How to get the number of characters of the full Harry Potter series? First, you need to get the pdf files from somewhere. My suggestion is TPB. Then you need to scrape off all the useless pdf markup and keep only the text, which is done by pdftotext, a nice Linux utility. Then you just count the characters with good old word count utility wc. In a Linux terminal this looks like this:

cd path/to/Harry_Potter_series/
pdftotext *pdf
wc *txt

And here is the output:

 6402   78439  454146 The Sorcerer's Stone.txt
6971 86256 509034 The Chamber of Secrets.txt
8926 108050 644246 The Prisoner of Azkaban.txt
14991 192569 1132498 The Goblet of Fire.txt
20732 259237 1535991 The Order of the Phoenix.txt
13560 170482 1014375 The Half-Blood Prince.txt
17631 200290 1159420 The Deathly Hallows.txt
89213 1095323 6449710 total

Total of 1 million words and 6.4 million characters.

Reading does not go very far, but it is still the best way to discover gems like this review of Airport, which says:

If you like the 707, Boeing, or airplanes in general, it's an excellent movie. A must-see!

At that point, I wanted to understand what makes a good movie good in the opinion of reviewers. The music? The acting? The script? By knowing what reviewers speak about in movies they rank high, I should be able to get to know that. This is where I turned to information theory, which I introduced in my previous post. There I explained that independence of two events or random variables realizes an optimum which gurantees to get maximum expected information, which means maximum knowledge of the system under study. But what happens if those variables are not independent? In that case, the shortage of information quantifies the redundancy between the variables and is called the mutual information. In that sense, there is no information loss when variables are dependent. The state of the system itself is less well known, but the knowledge of a variable says something about the others.

The mutual information score is computed as the difference of expected information with and without assumption of independence. For two variables it is $(\sum_{x,y}\log P(x,y)/\left( P(x)P(y) \right))$, where $(x)$ and $(y)$ run through all possible values. For each word of the corpus, by taking as $(x)$ its frequency in a review and as $(y)$ the score given to the movie, I could compute how informative a word is.

The most informative word in that sense turned out to be "worst", which is used 30 times more often in movies ranked 1 than in movies ranked 10, followed by "bad", "waste", "great", "horrible", "boring", "pretty", "best", "awesome", "terrible"... Quite a bummer. It seems reviewers express more vividly that they like or dislike a movie than why they do so.

But this did not turn hopless. The most surprising outcome was that punctuation is very informative. The question mark is used 3 times more frequently in movies ranked 1 than in movies ranked 10, suggesting that reviewers use questions in a negative, aggressive way. The comma are even more informative. Interestingly, they show a hump pattern: reviews associated with grades 5, 6 and 7 have on average 12 commas, whereas reviews associated with grades 1, 2 and 10 have on average less than 10. The same hump pattern appears for full stops. The conclusion is that when reviewers have a strong feeling for a movie (positive or negative), they keep it short, but when they are mitigated, they are more verbose.

What is going on? Do reviewers feel that they have to apologize at length for giving a 5 (but not for a 1 or 10)? Or perhaps reviewers change style depending on how much they liked the movie. When the movie makes them emotional, in a positive or negative way, they would write short and snappy reviews. And when it makes them analytical and doubtful they would dissect all the aspects of the movie at length. Or these people are simply not the same. Some reviewers go for an all-or-none approach, whereas others, more critical, tend to give average grades but tell more about the movie.

How to distinguish the two? I thought it was impossible, until by accident I discovered something extraordinary about one of the reviewers of my sample. But this is a story I will tell in my next post...

« | »

blog comments powered by Disqus

the Blog
Best of

the Lab
The team
Research lines
Work with us

Blog roll
Simply stats
Bits of DNA