The Grand Locus / Life for statistical sciences

## Did Sweden cheat at Eurovision?

In my previous post, I promised to go deeper into IMDB reviews, but I defer it until I deal with a more pressing issue.

Last Saturday was the Eurovision song contest. Somehow, my girlfriend managed to convinced me to sit through it (apologies to my fellow disincarnated academic researchers for such a treachery to our quest for knowledge).

Outside the epic performance of Ireland, already consecrated in the pantheon of memes, the show was plain boring. More specifically, it was redundant. Many songs were duplicates of each other and most were clear wannabes of successful artists (poor Amy, if you had seen what Italy did to you).

It was such a surprise, a shock I should say, that Sweden won the contest with the song Euphoria. Not that it was bad. Rather, that it was exactly like the songs we keep hearing every summer for more than 20 years. So, is this me getting old and not being able to recognize what's good music, or is there something fishy going on? I realized that the voting process is completely opaque and that nothing says that the IT counts the votes in a fair way. It would be so easy for the organizers to rig the contest and sell the first place to whichever country offers most. Just one line of code in the script that counts the votes and you're done! Nobody would ever notice it!!

This is where I decided to investigate. On an original idea of my girlfriend, I set out to download all the tweets of that day (May 26, 2012) with the mention "eurovision". Assuming that the Twitter community is an unbiased sample of the voters, we predicted that Sweden should actually be mentioned very often. At least more often than the other countries. I put the code I used to download the tweets in the following technical section.

To run this script, you need to get two Python modules from the Internet.

git clone git@github.com:sixohsix/twitter# Install it if need be.git clone git@github.com:ptwobrussell/Mining-the-Social-Web

Then, here is the code that I used to download the tweets. The files downloaded previously are not required to perform the queries because Twitter has a GET API which is fairly easy to use, but they definitely make it easier. Now here is the full Python script.

# -*- coding: utf-8 -*-import sysimport twitterimport jsonfrom time import sleepfrom twitter__util import makeTwitterRequest# Query terms.terms = {    'domain': 'search.twitter.com',    'q': 'eurovision',    'until':'2012-05-27',    'since':'2012-05-26',    'rpp': 100,}# Welcome to Twitter.t = twitter.Twitter(domain='search.twitter.com')# User may specify a point to resume.try:   max_id = terms['max_id'] = int(sys.argv[1]) - 1except IndexError:   max_id = float("inf")out_file = open('eurovision_tweets.txt', 'a')n_queries = 0while True:   n_queries += 1   success = False   retry = 0   for retry in range(3):      try:         query = makeTwitterRequest(t, t.search, **terms)         success = True         break      except:         sys.stderr.write('error, retrying (%d)\n' % (retry+1))         sleep(30)         continue   fetched_tweets = query['results']   if not success:      sys.stderr.write('max_id: %d\n' % max_id)      exit()   else:      sys.stderr.write('query %d OK fetched %d tweets\n' % \            (n_queries, len(fetched_tweets)))   if not fetched_tweets:      # Got them all!!      break   for tweet in fetched_tweets:      tweet_id = tweet['id']      mini_tweet = {          'id': tweet_id,          'text': tweet['text'],          'created_at': tweet['created_at'],      }      json.dump(fp=out_file, obj=mini_tweet)      out_file.write(',\n')      if tweet_id < max_id:         max_id = tweet_id   terms['max_id'] = max_id - 1

The hostility started at 00:00:07 UTC with ...

Eurovision Song Contest 2012: Final börjar på SVT1 klockan 21:00 #SVT1 http://t.co/dfGynpGQ

... and ended at 23:59:59 UTC with

RT @Queen_UK: Is that what they actually look like in Finland? #eurovision

In-between 2,886,410 tweets were posted. How do we go about analyzing them? In principle that's very simple: count all the tweets that speak about Sweden and compare it to the counts of other countries. But there is a 'but'. When it became clear that Sweden had won, there was probably massive tweeting about it, which inflates the count tremendously. So we need to see how much people were twitting about Sweden before the results.

Let us start by seeing the timing of twitting activity on Saturday May 26. Already we realize that the problem is more complicated than it looks. The first tweet of the day shown above (in Swedish) reminds us that Babel is a reality in Europe and that we have to deal with multiple languages. If the English were not thrilled by the performance of Sweden, chances are that there will be few tweets containing "Sweden". So we need to make sure that we cover the main languages of Europe. I chose English, German, Spanish and French as my working languages... sorry for the others.

The way I looked for "Sweden" in the tweets was through the following regular expression

grep -i 'swed\|su[^d]de\|suecia\|schweden'

The other ones, like "Russia" and "Moldova" are easier because they start by "russ" and "mold" in every language.

I plotted the results in the figure below, which says it all. I showed the total number of tweets, which peaked at 25,000 per minute around 19.30 UTC and went down to almost nothing by the end of the day. We can clearly see massive twitting about a country at the time of its performance. I show here Russia, Sweden and Moldova, Russia because it came second, and Moldavia because it was the last show, telling us when people starting to vote, or to tweet about their vote.

As you can see, there was more twitting about Russia during their performance, and the twitting about Moldova, which scored a mediocre 11-th rank out of 26, is substantially similar to that of Sweden. But if we look at the twitting activity between 21:00 and 21:30 UTC, we see that Russia and Sweden were higher than Moldova (and other countries as I could check). So this is the time when people expressed thei opinion on Twitter. In the end, Sweden scored 12,705 tweets during this half hour, and Russia scored 13,917.

Does this mean that Sweden has cheated and that Russia should have won? Probably not. The strange voting system of the Eurovision, which gives as many voting points to Luxemburg and to Germany can easily create an imbalance between the votes and the sheer representation of the voters. Still, it seems that the Russian "babuchkis" gave an overall stronger impression.

In conclusion Sweden probably did not cheat. It's just that I can't appreciate what's good music when I hear it!