Are you human?

By Guillaume Filion, filed under Python, Information retrieval, movies, IMDB, series: IMDB reviews, crawler.

• 08 July 2012 •

On the Internet, nobody knows you're a dog.

This is the text of a famous cartoon by Peter Steiner that I reproduced below. This picture marked a turning point in the use of identity on the Internet, when it was realized that you don't have to tell the truth about yourself. The joke in the cartoon pushes it to the limit, as if you do not even have to be human. But is there anything else than humans on the Internet?

Actually yes. The Internet is full of robots or web bots. Those robots are not pieces of metal like Robby the robot. Instead, they are computer scripts that issue network requests and process the response without human intervention. How much of the world traffic those web bots represent is hard to estimate, but sources cited on Wikipedia mention that the vast majority of email is spam (usually sent by spambots), so it might be that humans issue a minority of requests on the Internet.

In my previous post I mentioned that computers do not understand humans. For the same reasons, it is sometimes difficult for a server to determine whether it is processing a request issued by a human or by a robot. In the early days of the Internet, even million dollar sites like Yahoo! and Amazon were not protected against massive poisonous requests and were taken down by a 17 year old school boy. It is around that time that appeared the CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) which, as the name says is a reverse Turing test. The gist of reverse Turing tests is to process the request only if the client can perform a task that is very easy for a human, and very difficult for a computer (like recognizing a distorted text, a picture with a house or the meaning of a simple sentence).

In short, telling robots apart from humans is important for security reasons. Yet, most servers do not want to deny access to every robot because search engines like Google use robots called web spiders or web crawlers to index web pages. The current agreement is that servers should indicate their policy in a page called robots.txt (out of curiosity you can check the robots.txt page of the blog, but it only contains the address of the site map). The content tells the robots which pages they should not request, but does not prevent them to do so in any way. Not surprisingly, most spammers or petty hackers do not take the time to read the robots.txt page... perhaps some do not even know it exists. So robots could issue a request to every page anyway, right? Well, let's check that out. In the technical section below I show a very simple Python script to get the reviews of user 2467618 on IMDB.

If Python is installed on your computer (which will be the case if you are using Mac or Linux) you can start a Python session (version 2.x) and use the module urllib as shown.

import urllib
content = urllib.urlopen(
    'http://www.imdb.com/user/ur2467618/comments?order=date&start=0'
).read()
f = open('downloaded_content.html', 'w')
f.write(content)
f.close()

You can now open the file downloaded_content.html in your home directory with your favorite browser to see what it contains.

In case you are not Python-proficient, you can check out what the script retrives here. Among others, you will notice that it says "Access denied http: 403". Sure enough, the robots.txt file of IMDB says that requests to /user are disallowed. So how do servers protect themselves from unwanted queries issued by robots?

There is no universal answer, but in the case of IMDB like in many others, the answer lies in the HTTP headers. You might notice in the "Access denied" page that it says at the bottom "Browser: Python-urllib/1.17". The issue here is that by default, urllib is honest about the user agent, which is easily intercepted and denied by the server. If we decide to lie about our user agent and claim we issue the request through Chrome, we would do as indicated in the the following technical part instead.

Here is an improved version of the script which masquerades the HTTP headers to pretend we are using Chrome to request the page.

import urllib2
import cookielib
cookies = cookielib.LWPCookieJar()
handlers = [ 
   urllib2.HTTPHandler(),
   urllib2.HTTPSHandler(),
   urllib2.HTTPCookieProcessor(cookies),
]   
opener = urllib2.build_opener(*handlers)
headers = {
   'Accept': 'text/html,application/xhtml+xml,'\
             'application/xml;q=0.9,*/*;q=0.8',
   'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
   'Accept-Encoding': 'gzip,deflate,sdch',
   'Accept-Language': 'en-US,en;q=0.8,fr;q=0.6',
   'Connection': 'keep-alive',
   'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.19 '\
                 '(KHTML, like Gecko) Ubuntu/12.04 '\
                 'Chromium/18.0.1025.151 Chrome/18.0.1025.151 '\
                 'Safari/535.19',
}
request = urllib2.Request(
    url='http://www.imdb.com/user/ur2467618/comments?order=date&start=0', 
    headers=headers
)
connection = opener.open(request)
content = connection.read()
# The content is gzip-compressed.
f = open('downloaded_content.html.gz', 'wb')
f.write(content)
f.close()

As you can check by decompressing the file downloaded_content.html.gz, we get the same content as if we had issued the request from Chrome.

To download the page, we had to set the HTTP headers to their values when we request the page from Chrome, which makes the code substantially more complicated. We can get the values of those headers very easily in Chrome by clicking on the wrench tool in the top-right corner and then choosing Tools > Developer Tools. This displays a console which has a "Network" item where all requests are analyzed and where you can find the values of all the HTTP headers, data and cookies. You can also use Firebug on Firefox, or a sniffer like Wireshark to analyze the traffic to and from your browser.

By setting those headers to the values they have when the request is issued by a human through a browser, it is much harder to recognize that the request actually comes from a script. One thing still that most servers check is the frequency and the regularity of the requests. I don't know any human who would issue over nine thousand requests with less than a second interval. Neither do system administrators, and this is why they might block the IP address those requests are issued from. Basically, by masquerading HTTP headers, and breaking the regularity patterns in the requests, it becomes very difficult to distinguish humans from robots without CAPTCHAs.

If you ever wondered, the answer is yes: this is what I did to fetch the reviews from IMBD. But then, why would I ever be interested in user 2467618? This is what I will expand on in my next post.

« Previous Post | Next Post »

blog comments powered by Disqus

Subscribe...

Share...

Are you human?