The Grand Locus / Life for statistical sciences

Subscribe...


the Blog

Meet planktonrules

Some of you may remember planktonrules from my series on IMDB reviews. For those of you who missed it, planktonrules is an outlier. In my attempt to understand what IMDB reviewers call a good movie, I realized that one reviewer in particular had written a lot of reviews. When I say a lot, I mean 14,800 in the last 8 years. With such a boon, I could not resist the temptation to use his reviews to analyze the variation of style between users, and to build a classifier that recognizes his reviews.

I finally got in contact with Martin Hafer (planktonruleā€™s real name) this year, and since he had planned to visit Barcelona, we set up a meeting in June. I have to admit that I expected him to be a sort of weirdo, or a cloistered sociopath. The reality turned out to be much more pleasant; we had an entertaining chat, speaking very little about movie reviews. He also pointed out to me that doing statistics on what people write on the Internet is a bit weird... True that.

Anyway, as an introduction, here is a mini interview of planktonrules. You can find out more...






One shade of authorship attribution

This article is neither interesting nor well written.

Everybody in the academia has a story about reviewer 3. If the words above sound familiar, you will definitely know what I mean, but for the others I should give some context. No decent scientific editor will accept to publish an article without taking advice from experts. This process, called peer review, is usually anonymous and opaque. According to an urban legend, reviewer 1 is very positive, reviewer 2 couldn't care less, and reviewer 3 is a pain in the ass. Believe it or not, the quote above is real, and it is all the review consists of. Needless to say, it was from reviewer 3.

For a long time, I wondered whether there is a way to trace the identity of an author through the text of a review. What methods do stylometry experts use to identify passages from the Q source in the Bible, or to know whether William Shakespeare had a ghostwriter?

The 4-gram method

Surprisingly, the best stylistic fingerprints have little to do with literary style. For instance, lexical richness and complexity of the language are very difficult to exploit efficiently. The unconscious foibles...






The geometry of style

This is it! I have been preparing this post for a very long time and I will finally tell you what is so special about IMDB user 2467618, also known as planktonrules. But first, let me take you back where we left off in this post series on IMDB reviews.

In the first post I analyzed the style of IMDB reviews to learn which features best predict the grade given to a movie (a kind of analysis known as feature extraction). Surprisingly, the puncutation and the length of the review are more informative than the vocabulary. Reviews that give a medium mark (i.e. around 5/10) are longer and thus contain more full stops and commas.

Why would reviewers spend more time on a movie rated 5/10 than on a movie rated 10/10? There is at least two possibilities, which are not mutually exclusive. Perhaps the absence of a strong emotional response (good or bad) makes the reviewer more descriptive. Alternatively, the reviewers who give extreme marks may not be the same as those who give medium marks. The underlying question is how much does the style of a single reviewer change with his/her...






the Blog
Best of
Archive
About

the Lab
The team
Research lines
Work with us

Blog roll
Simply stats
ACGT
atcgeek
opiniomics
cryptogenomicon
Bits of DNA