•

the Blog

## One shade of authorship attribution

By Guillaume Filion, filed under
planktonrules,
Python,
machine learning,
R,
IMDB,
series: IMDB reviews,
automatic authorship attribution.

• 23 March 2013 •

This article is neither interesting nor well written.

Everybody in the academia has a story about *reviewer 3*. If the words above sound familiar, you will definitely know what I mean, but for the others I should give some context. No decent scientific editor will accept to publish an article without taking advice from experts.
This process, called peer review, is usually anonymous and opaque. According to an urban legend, reviewer 1 is very positive, reviewer 2 couldn't care less, and reviewer 3 is a pain in the ass. Believe it or not, the quote above is real, and it is all the review consists of. Needless to say, it was from reviewer 3.

For a long time, I wondered whether there is a way to trace the identity of an author through the text of a review. What methods do stylometry experts use to identify passages from the Q source in the Bible, or to know whether William Shakespeare had a ghostwriter?

### The 4-gram method

Surprisingly, the best stylistic fingerprints have little to do with literary style. For instance, lexical richness and complexity of the language are very difficult to exploit efficiently. The unconscious foibles...

## Focus on: multiple testing

By Guillaume Filion, filed under
multiple testing,
familywise error rate,
p-values,
R,
false discovery rate,
series: focus on,
hypothesis testing.

• 14 September 2012 •

With this post I inaugurate the focus on series, where I go much more in depth than usual. I could as well have called it *the gory details*, but *focus on* sounds more elegant. You might be in for a shock if you are more into easy reading, so the *focus on* is also here as a warning sign so that you can skip the post altogether if you are not interested in the detail. For those who are, this way please...

In my previous post I exposed the multiple testing problem. Every null hypothesis, true or false, has at least a 5% chance of being rejected (assuming you work at 95% confidence level). By testing the same hypothesis several times, you increase the chances that it will be rejected at least once, which introduces a bias because this one time is much more likely to be noticed, and then published. However, being aware of the illusion does not dissipate it. For this you need insight and statistical tools.

### Fail-safe $(n)$ to measure publication bias

Suppose $(n)$ independent research teams test the same null hypothesis, which happens to be true — so not interesting. This means that the...

## Why p-values are crap

By Guillaume Filion, filed under
R,
random walks,
probability,
p-values.

• 03 April 2012 •

I remember my statistics classes as a student. To do a t-test we had to carry out a series of tedious calculations and in the end look up the value in a table. Making those tables cost an enormous amount of sweat from talented statisticians, so you had only three tables, for three significance levels: 5%, 1% and 0.1%. This explains the common way to indicate significance in scientific papers, with one (*), two (**) or three (***) stars. Today, students use computers to do the calcultations so the star notation probably appears as a mysterious folklore and the idea of using a statistical table is properly unthinkable. And this is a good thing because computing those t statistics by hand was a pain. But statistical softwares also paved the way for the invasion of p-values in the scientific literature.

To understand what is wrong with p-values, we will need to go deeper in the theory of statistical testing, so let us review the basic principles. Every statistical test consists of a null hypothesis, a test statistic (a score) and a decision rule — plus the often forgotten alternative hypothesis. A statistical test is an investigation protocol to...

## The Brownian labyrinth

By Guillaume Filion, filed under
stochastic processes,
R,
random walks.

• 28 March 2012 •

Architecture and art show that human culture often uses the same basic shapes. Among them, labyrinth is an outsider for its complexity. Made famous by the Greek myth of Theseus and the Minotaur, labyrinths are found in virtually every culture and every era. The Wikipedia entry of labyrinth shows different designs, but they all have in common the intricate folding of a path onto itself, in such a way that the distance you have to walk inside the labyrinth is much larger than your actual displacement in space.

Fictions of all genres are also fraught with labyrinths. Perhaps one of the most vivid appearance of the labyrinth theme in literature is The Garden of Forking Paths by Borges. In this short story, Borges evokes a perfect labyrinth. Like in gamebooks, this special book that follows every possible ramification of the plot, and not just one. In some passages the hero dies, in some others he lives, in such a way that one can read the novel in infinitely many ways.

An invisible labyrinth of time. To me, a barbarous Englishman, has been entrusted the revelation of this diaphanous mystery. After more than a hundred years, the details are...

## Drunk man walking

By Guillaume Filion, filed under
stochastic processes,
R,
probability,
random walks.

• 15 March 2012 •

Lotteries fascinate the human mind. In the The Lottery in Babylon, Jorge Luis Borges describes a city where the lottery takes a progressively dominant part in people’s life, to the extent that every decision, even life and death, becomes subject to the lottery.

In this story, Borges brings us face to face with the discomfort that the concept of randomness creates in our mind. Paradoxes are like lighthouses, they indicate a dangerous reef, where the human mind can easily slip and fall into madness, but they also show us the way to greater understanding.

One of the oldest paradoxes of probability theory is the so called Saint Petersburg paradox, which has been teasing statisticians since 1713. Imagine I offered you to play the following game: if you toss ‘tails’, you gain $1, and as long as you toss ‘tails’, you double your gains. The first ‘heads’ ends the spree and determines how much you gain. So you could gain $0, $1, $2, $4, $8... with probability 1/2, 1/4, 1/8, 1/16, 1/32 etc. What is the fair price I can ask you to play the Saint Petersburg lottery?

Probability theory says that the...