•
the Blog
One shade of authorship attribution
By Guillaume Filion, filed under
planktonrules,
Python,
machine learning,
R,
IMDB,
series: IMDB reviews,
automatic authorship attribution.
• 23 March 2013 •
This article is neither interesting nor well written.
Everybody in the academia has a story about reviewer 3. If the words above sound familiar, you will definitely know what I mean, but for the others I should give some context. No decent scientific editor will accept to publish an article without taking advice from experts. This process, called peer review, is usually anonymous and opaque. According to an urban legend, reviewer 1 is very positive, reviewer 2 couldn't care less, and reviewer 3 is a pain in the ass. Believe it or not, the quote above is real, and it is all the review consists of. Needless to say, it was from reviewer 3.
For a long time, I wondered whether there is a way to trace the identity of an author through the text of a review. What methods do stylometry experts use to identify passages from the Q source in the Bible, or to know whether William Shakespeare had a ghostwriter?
The 4-gram method
Surprisingly, the best stylistic fingerprints have little to do with literary style. For instance, lexical richness and complexity of the language are very difficult to exploit efficiently. The unconscious foibles...
Are you human?
By Guillaume Filion, filed under
Python,
Information retrieval,
movies,
IMDB,
series: IMDB reviews,
crawler.
• 08 July 2012 •
On the Internet, nobody knows you're a dog.
This is the text of a famous cartoon by Peter Steiner that I reproduced below. This picture marked a turning point in the use of identity on the Internet, when it was realized that you don't have to tell the truth about yourself. The joke in the cartoon pushes it to the limit, as if you do not even have to be human. But is there anything else than humans on the Internet?
Actually yes. The Internet is full of robots or web bots. Those robots are not pieces of metal like Robby the robot. Instead, they are computer scripts that issue network requests and process the response without human intervention. How much of the world traffic those web bots represent is hard to estimate, but sources cited on Wikipedia mention that the vast majority of email is spam (usually sent by spambots), so it might be that humans issue a minority of requests on the Internet.
In my previous post I mentioned that computers do not understand humans. For the same reasons, it is sometimes difficult for a server to determine whether it is processing a request...