The Grand Locus / Life for statistical sciences

the Blog

## Focus on: the Kullback-Leibler divergence

The story of the Kullback-Leibler divergence starts in a top secret research facility. In 1951, right after the war, Solomon Kullback and Richard Leibler were working as cryptanalysts for what would soon become the National Security Agency. Three years earlier, Claude Shannon had shaken the academic world by formulating the modern theory of information. Kullback and Leibler immediately saw how this could be useful in statistics and they came up with the concept of information for discrimination, now known as relative entropy or Kullback-Leibler divergence.

The concept was introduced in an oringinal article, and later expanded by Kullback in the book Information Theory and Statistics. It has now found applications in most aspects of information technologies, and most prominently artificial neural networks. In this post, I want to give an advanced introduction on this concept, hoping to make it intuitive.

### Discriminating information

The original motivation given by Kullback and Leibler is still the best way to expose the main idea, so let us follow their rationale. Suppose that we hesitate between two competing hypotheses $H_1$ and $H_2$. To make things more concrete, say that we have an encrypted message $x$ that may come from two possible...

## A tutorial on t-SNE (1)

In this tutorial, I would like to explain the basic ideas behind t-distributed Stochastic Neighbor Embedding, better known as t-SNE. There are tons of excellent material out there explaining how t-SNE works. Here, I would like to focus on why it works and what makes t-SNE special among data visualization techniques.

If you are not comfortable with formulas, you should still be able to understand this post, which is intended to be a gentle introduction to t-SNE. The next post will peek under the hood and delve into the mathematics and the technical detail.

### Dimensionality reduction

One thing we all agree on is that we each have a unique personality. And yet it seems that five character traits are sufficient to sketch the psychological portrait of almost everyone. Surely, such portraits are incomplete, but they capture the most important features to describe someone.

The so-called five factor model is a prime example of dimensionality reduction. It represents diverse and complex data with a handful of numbers. The reduced personality model can be used to compare different individuals, give a quick description of someone, find compatible personalities, predict possible behaviors etc. In many...

## A tutorial on Burrows-Wheeler indexing methods (3)

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. The first part describes the theoretical background, the second part shows a naive C implementation, and this part shows a more advanced implementation with compression.

The code is written in a very naive style, so you should not use it as a reference for good C code. Once again, the purpose is to highlight the mechanisms of the algorithm, disregarding all other considerations. That said, the code runs so it may be used as a skeleton for your own projects.

The code is available for download as a Github gist. As in the second part, I recommend playing with the variables, and debugging it with gdb to see what happens step by step.

### Constructing the suffix array

First you should get familiar with the first two parts of the tutorial in order to follow the logic of the code below. The file learn_bwt_indexing_compression.c does the same thing as in the second part. The input, the output and the logical flow are the same, but the file is different in many details.

We start with the definition of the occ_t...

## A tutorial on Burrows-Wheeler indexing methods (2)

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. The first part describes the theoretical background, this part shows a naive C implementation of the example followed through in the first part and the third part shows a more advanced implementation with compression.

It makes little sense to implement a Burrows-Wheeler index in a high level language such as Python or JavaScript because we need tight control of the basic data structures. This is why I chose C. The purpose of this post is not to show how Burrows-Wheeler indexes should be implemented, but to help the reader understand how it works in practice. I tried to make the code as clear as possible, without regard for optimization. It is only a plain, vanilla, implementation.

The code runs, but I doubt that it can be used for anything else than demonstrations. First, it is very naive and hard to scale up. Second, it does not use any compression nor down-sampling, which are the mainsprings of Burrows-Wheeler indexes.

The code is available for download as a Github gist. It is interesting for beginners to play with...

## A tutorial on Burrows-Wheeler indexing methods (1)

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. This part describes the theoretical background, the second part shows a naive C implementation of the example below, and the third part shows a more advanced implementation with compression.

There are many resources explaining how the Burrows-Wheeler transform works, but so far I have not found anything explaining what makes it so awesome for indexing and why it is so widely used for short read mapping. I figured I would write such a tutorial for those who are not afraid of the detail.

### The problem

Say we have a sequencing run with over 100 million reads. After processing, the reads are between 20 and 25 nucleotide long. We would like to know if these sequences are in the human genome, and if so where.

The first idea would be to use grep to find out. On my computer, looking for a 20-mer such as ACGTGTGACGTGATCTGAGC takes about 10 seconds. Nice, but querying 100 million sequences would take more than 30 years. Not using any search index, grep needs to scan the whole human genome, and this takes time...

## Once upon a BLAST

The story of this post begins a few weeks ago when I received a surprising email. I have never read a scientific article giving a credible account of a research process. Only the successful hypotheses and the successful experiments are mentioned in the text — a small minority — and the painful intellectual labor behind discoveries is omitted altogether. Time is precious, and who wants to read endless failure stories? Point well taken. But this unspoken academic pact has sealed what I call the curse of research. In simple words, the curse is that by putting all the emphasis on the results, researchers become blind to the research process because they never discuss it. How to carry out good research? How to discover things? These are the questions that nobody raises (well, almost nobody).

Where did I leave off? Oh, yes... in my mailbox lies an email from David Lipman. For those who don’t know him, David Lipman is the director of the NCBI (the bio-informatics spearhead of the NIH), of which PubMed and GenBank are the most famous children. Incidentally, David is also the creator of BLAST. After a brief exchange on the topic of my previous...

## Focus on: multiple testing

With this post I inaugurate the focus on series, where I go much more in depth than usual. I could as well have called it the gory details, but focus on sounds more elegant. You might be in for a shock if you are more into easy reading, so the focus on is also here as a warning sign so that you can skip the post altogether if you are not interested in the detail. For those who are, this way please...

In my previous post I exposed the multiple testing problem. Every null hypothesis, true or false, has at least a 5% chance of being rejected (assuming you work at 95% confidence level). By testing the same hypothesis several times, you increase the chances that it will be rejected at least once, which introduces a bias because this one time is much more likely to be noticed, and then published. However, being aware of the illusion does not dissipate it. For this you need insight and statistical tools.

### Fail-safe $n$ to measure publication bias

Suppose $n$ independent research teams test the same null hypothesis, which happens to be true — so not interesting. This means that the...