The Grand Locus / Life for statistical sciences

The elements of style

Let us continue this series of posts on IMDB reviews. In the previous post I used mutual information to identify a consistent trend in the reviews: very positive and very negative reviews are shorter than average reviews by about 2 sentences. But how can we give a full description of the style of reviews? And, what is style anyway?

Let's refer to the definition.

style /stīl/: A manner of doing something.

So style covers every feature of the text, from lexical (use of the vocabulary) to semantic (meaning attributed to expressions). The question of style has kept the field of Natural Language Processing (NLP) very busy because this is a strong indicator of the content of a text. What is it about? How reliable is it? Who is the author? However, most of the emphasis is on the syntax, because semantics is still a long and painful way ahead. Alan Turing, by his claim that a machine is able to think if it is able to communicate with humans in their natural languages (the Turing test), sparked a general interest for the question of language in the field of artificial intelligence. A bunch of chatting robots come to light every year, but so far their performances are more hilarious than impressive. Still, if computers do a pretty bad job at understanding a text, they can do a reasonable job at describing and classifying it.

One of the classical approaches of NLP is to identify texts with the points of a geometric space. The text space has such a large dimension that we cannot represent it graphically in our universe, but we can still work with the intuition that two points close to each other represent texts with the same style, whereas two points far apart represent texts with very different styles.

The most immediate way to give a geometric description of a text is to compute the frequencies of its symbols and declare that these numbers are the coordinate of the text. For instance, the text "to be or not to be" would be described by the vector (2/6, 2/6, 1/6, 1/6) in the space defined by the words ("to", "be", "or", "not"). In this representation, the order of the words has disappeared. Only their frequencies matter, which is why this representation is called a bag of words.

In practice, though, the conversion is done in a more subtle way. There are four steps to extract a reasonable bag of words from a text.

1. Tokenize the text
The tokens of a text are the symbolic units that compose it (the words). A text is usually tokenized to sentences, before being fully tokenized. In English and other European languages, this step is straightforward (even though expressions such as "Dr. Pepper is sugar-free" are not so easy to tokenize), but in other languages, such as Chinese, the separation between the tokens might be context-dependent, making the task much more difficult.

2. Remove the stop words
Stop words are uninformative and very abundant words of a language. In English, "the", "is", "at" are considered stop words and are simply ignored. Stop words can be either generic or specialized. For example PubMed provides their own list of stop words, i.e. words that never participate in the search for articles. Among them are "kg", "km" and "thus", which are usually absent from the lists of stop words for plain English. In a general corpus, these words denote a text of scientific nature, but in a scientific corpus, they bear no discriminating information.

3. Stem the words
Stemming is the process of reducing words to their root. Many words end differently in different grammatical contexts, and stemming is a way to make sure that the same word is not counted as two different words. For example in "I say" and "he says", the word "say" is written in two different ways, this is why they are both reduced to "say" and recognized as a single word. Each language has their own stemming algorithms. In English, Porter's stemming algorithm is the most popular and gives good results.

4. Compute tf-idf
Actually word frequencies are not commonly used to describe a text because they do not take into account the specificities of the corpus. A text is considered a part of a corpus, i.e. the place it was sampled from, from which it inherits some lexical properties. Usually the emphasis is not on the style of a text in absolute terms, but rather on the style of the text relative to the corpus. The empirical score tf-idf stands for "term frequency times inverse document frequency". The first term, "tf", does not need further clarification, this is simply the frequency of the stemmed token in the text. The second term, "idf", is the logarithm of the inverse proprtion of documents that contain this term. This is a way to buffer the frequency of frequent terms of the corpus. If the term is present in every text, the logarithm will be 0 and the overall tf-idf score will be 0. Likewise, if the term is frequent, the corresponding tf-idf will be low, even in texts that use it often. Inversely, if the term is a hapax (term that occurs only once in a corpus), it can be considered a distinctive feature of the text and will have a high tf-idf, even if its frequency within the text is minimal.

To process and compare the IMDB reviews I wrote such a pipeline that I reproduce in the next technical section (click on the Penrose triangle to unfold). In this case I actually used word frequencies instead of tf-idf, because I found the code complex enough.

Here is a Python script that I wrote to parse the html content of the reviews. It requires NLTK (Natural Language ToolKit) and BeautifulSoup (an html parser). This code is very much tailored to work with reviews downloaded from IMDB, after some cleaning of the irrelevant html information. I have uploaded a sample of 10 processed review texts here. If you name the script wordbag.py and you have properly installed NLTK and BeautifulSoup you can then run the script in your terminal as shown below.

python wordbag.py imdb_reviews_sample.txt


Be warned, though that the script will most likely not run with other inputs in such an easy way. Here it is.

# -*- coding: utf-8 -*-import sysimport reimport jsonimport nltkimport datetime as dtimport htmlentitydefsfrom BeautifulSoup import BeautifulSoup as bsfrom dateutil.parser import parsefrom nltk.stem.porter import PorterStemmer# Get the stop words from NLTK.stopwords = nltk.corpus.stopwords.words('english')Porter = PorterStemmer()def main():   docs = []   in_file = sys.argv[1]   with open(in_file, 'r') as html:      soup = bs(html.read())   # Reviews are within a <div> tag. All other <div> have been removed.   for rev in soup.findAll('div'):      doc = {}      movie = rev.find('span', attrs={'class':'movie'}).getText()      doc['movie'] = unescape(movie)      where = rev.find('link', attrs={'rel':'canonical'})      what = re.search('tt\d{7}', dict(where.attrs)['href']).group()      doc['movieid'] = what      try:         p = rev.findAll('p')         head = p.pop(0) # First <p> tag.         body = p.pop()  # Last <p> tag. Contains the review text.      except IndexError:         # No review for that film.         docs.append(doc)         continue      title = unescape(head.find('b').getText())      doc["title"] = title      authref = head.find('a')      authid = re.search('ur\d{7}', dict(authref.attrs)['href']).group()      auth = unescape(authref.getText())      doc['authid'] = authid      doc['auth'] = auth      # Define doc '_id' field as the concatenation of movie      # id and author id (for consistent sorting in CouchDB).      doc['_id'] = doc['movieid'] + doc['authid']      use = date = None      for small in [x.getText() for x in head.findAll('small')]:         if re.search('found the following review useful', small):            use = re.search('(\d+) out of (\d+)', small).groups()            doc['use'] = use            continue         else:            try:               date = parse(small)               doc['date'] = date.strftime('%Y-%m-%d')               break            except Exception:               continue      # The grade is within an <img> tag.      img = head.find('img')      grade = dict(img.attrs)['alt'] if img else None      doc['grade'] = grade      # Here comes the text of the review. Here we just clean the html.       bodytext = unescape(unicode(body)).replace('<br />', '\n')      doc['body'] = bodytext.replace('<p> ', '').replace('</p>', '')      docs.append(doc)      # Here we tokenize, lower-case, remove the stop words and stem.      st = [s.lower() for s in nltk.tokenize.sent_tokenize(doc['body'])]      words = [Porter.stem(word) for s in st for word \            in nltk.tokenize.word_tokenize(s) \            if not word in stopwords]      # And finally compute the frequencies.      doc['fdist'] = dict(nltk.FreqDist(words))   json.dump(docs, sys.stdout, indent=4)def unescape(text):   """Copied from http://effbot.org/zone/re-sub.htm#unescape-html"""   def fixup(m):      text = m.group(0)      if text[:2] == '&#':         # character reference         try:            if text[:3] == '&#x':               return unichr(int(text[3:-1], 16))            else:               return unichr(int(text[2:-1]))         except ValueError:            pass      else:         # named entity         try:            text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])         except KeyError:            pass      return text # leave as is   return re.sub('&#?\w+;', fixup, text)if __name__ == '__main__':   main()