The Grand Locus / Life for statistical sciences

the Blog

## A tutorial on Burrows-Wheeler indexing methods (3)

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. The first part describes the theoretical background, the second part shows a naive C implementation, and this part shows a more advanced implementation with compression.

The code is written in a very naive style, so you should not use it as a reference for good C code. Once again, the purpose is to highlight the mechanisms of the algorithm, disregarding all other considerations. That said, the code runs so it may be used as a skeleton for your own projects.

The code is available for download as a Github gist. As in the second part, I recommend playing with the variables, and debugging it with gdb to see what happens step by step.

### Constructing the suffix array

First you should get familiar with the first two parts of the tutorial in order to follow the logic of the code below. The file learn_bwt_indexing_compression.c does the same thing as in the second part. The input, the output and the logical flow are the same, but the file is different in many details.

We start with the definition of the occ_t...

## A tutorial on Burrows-Wheeler indexing methods (2)

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. The first part describes the theoretical background, this part shows a naive C implementation of the example followed through in the first part and the third part shows a more advanced implementation with compression.

It makes little sense to implement a Burrows-Wheeler index in a high level language such as Python or JavaScript because we need tight control of the basic data structures. This is why I chose C. The purpose of this post is not to show how Burrows-Wheeler indexes should be implemented, but to help the reader understand how it works in practice. I tried to make the code as clear as possible, without regard for optimization. It is only a plain, vanilla, implementation.

The code runs, but I doubt that it can be used for anything else than demonstrations. First, it is very naive and hard to scale up. Second, it does not use any compression nor down-sampling, which are the mainsprings of Burrows-Wheeler indexes.

The code is available for download as a Github gist. It is interesting for beginners to play with...

## A tutorial on Burrows-Wheeler indexing methods (1)

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. This part describes the theoretical background, the second part shows a naive C implementation of the example below, and the third part shows a more advanced implementation with compression.

There are many resources explaining how the Burrows-Wheeler transform works and how it can be used to index a text, but so far I have not found anything explaining what makes it so awesome. I figured I would write a tutorial like this for those who are not afraid of the detail.

### The problem

Say we have a sequencing run with over 100 million reads. After processing, the reads are between 20 and 25 nucleotide long. We would like to know if these sequences are in the human genome, and if so where.

The first idea would be to use grep to find out. On my computer, looking for a 20-mer such as ACGTGTGACGTGATCTGAGC takes about 10 seconds. Nice, but querying 100 million sequences would take more than 30 years. Not using any search index, grep needs to scan the whole human genome, and this takes time when the...

## Stick breaking and DNA alignment

A couple of months ago, I posted an approximate formula for the longest match in the problem of DNA alignment. I recently used it to calibrate a seeding heuristic to map Illumina reads and I was surprised to see that it was not just bad, but epic bad. Upon closer inspection, I realized that the main assumption does not hold when the error rate is small (which is typically the case for Illumina reads). The formula was based on longest runs in Bernoulli trials. This time I present more accurate results with an approach based on a stick breaking process.

### Stick breaking (spacings)

Inserting $k$ mutations at random in a sequencing read will produce $k+1$ (possibly empty) subsequences without errors. The process is analogous to inserting $k$ breaks at random in a stick of length 1, and we can approximate the distribution of the longest subsequence without error by that of the longest fragment when breaking the stick.

The example above illustrates this concept graphically. A sequencing read of 60 nucleotides contains 2 mutations highlighted in red and the longest error-free stretch is the central subsequence of 28 nucleotides. If mutations occur uniformly on the read...

A brief note published a few weeks ago initiated a discussion on the blogosphere about who is a bioinformatician and who is not. According to Wikipedia

bioinformatics combines computer science, statistics, mathematics, and engineering to study and process biological data.

To see how the community defines itself, I downloaded the abstracts of Bioinformatics from 2014 (for a total around 1000 articles, extracted the most relevant keywords and put the top 100 terms in a word cloud where a word size shows its frequency*.

Obviously, bioinformatics is about data, mostly gene/protein sequences and expression. It is also good to know that bioinformatics likes genomes and networks, and that it has more affinities with structural biology than with evolution.

The favourite organism of bioinformaticians is Homo sapiens, actually it is the only one mentioned in the word cloud, and when bioinformaticians work on a disease, it is cancer.

When bioinformaticians describe their work, it is “novel” and “new”, and what they talk about is “biological”, “different”, “multiple” and “single” (the last two are usually followed by “sequence alignment” and “nucleotide polymorphism”). It is also “functional” and “available”, but somewhat less. I expected it to be “fast” and “accurate”, but...

## Why do bioinformatics?

I never planned to do bioinformatics. It just happened because I liked the time in front of my computer. Still, as every sane individual, I sometimes think that I could do something else with my life, and I wonder whether I am doing the right thing. On this topic, I recently came across the famous farewell to bioinformatics by Frederick J. Ross, which is worth reading, and of which the most emblematic quote is definitely the following.

My attitude towards the subject after all my work in it can probably be best summarized thus: Fuck you, bioinformatics. Eat shit and die.

There is nothing to agree or disagree in this quote, but Frederick gives further detail about his point of view in the post. In short, bioinformaticians are bad programmers, and community-level obfuscation maintains the illusion.

By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques and the slowest languages, by not publishing their algorithms and making their results impossible to replicate, the field managed to reduce its productivity by at least 90%, probably closer to 99%.

There are indeed many issues in the bioinformatics community and I...

## If cars were made by bioinformaticians...

### 1. Cars would have nice names

Here is what an abstract describing a car would look like.

Transporting people to defined locations of interest is a challenge of significant economic importance. To achieve this goal, people usually use cars or public transport. However, these solutions are suboptimal in several conditions. For instance, when people are extremely close to their target location, both cars and public transport are inappropriate, which limits their practical use. Here we present CaЯ (vehiCle for chAnging geo-cooЯdinates), a fast and accurate tool as an alternative to existing vehicles.

### 2. Cars would be fast and accurate

Bioinformaticians develop fast and accurate software. Their cars would be just the same. Here is what a typical benchmark sections would look like.

To show that CaЯ is faster and more accurate than existing alternatives, we benchmarked CaЯ against Volvo XC90 and Ferrari F430. In the first series of tests, we measured the time to lower the front windows of the vehicles. The average run duration was 2.3 seconds for CaЯ, 3.1 seconds for Volvo XC90 and 3.9 seconds for Ferrari F430, which demonstrates that CaЯ is substantially faster than Volvo XC90 and Ferrari...

## Longest runs and DNA alignments

Update: The key assumption of the approximation below is that $nq$ is a relatively big number. This happens if the read is very long ($n$ is large) and/or the eror rate is high ($q$ is large). While these assumptions are reasonable with PacBio or Oxford Nanopore technologies, they are not for Illumina reads. In this case, I recommend using the formula based on stick-breaking from this post (it also holds for PacBio and Oxford Nanopore by the way).

The problem of sequence alignment gets a lot of attention from bioinformaticians (the list of alignment software counts more than 200 entries). Yet, the statistical aspect of the problem is often neglected. In the post Once upon a BLAST, David Lipman explained that the breakthrough of BLAST was not a new algorithm, but the careful calibration of a heuristic by a sound statistical framework.

Inspired by this idea, I wanted to work out the probability of identifying best hits in the problem of long read alignments. Since this is a fairly general result and that it may be useful for many similar applications, I post it here for reference.

## Bioinformatics without Excel

About a year after setting up my laboratory, an observation suddenly hit me. All the job applicants were biologists who wanted to do bioinformatics. I was myself trained as an experimental biologist and started bioinformatics during my post-doc. They saw in my laboratory the opportunity to do the same. Indeed, “how did you become a bioinformatician?” is a question that I hear very often.

For lack of a better plan, most people grab a book about Linux or sign up for a Coursera class, try to do a bit every day and... well, just learn bioinformatics. I have seen extremely few people succeed this way. The content inevitably becomes too difficult, motivation decreases and other commitments take over. I will not lie, self-learning bioinformatcs is hard and it is frustrating... but it can be fun if you know how to do it. And most importantly, if you understand your worst enemy: yourself.

Here is a small digest of how it happened for me. I do not mean that this is the only way. I simply hope that this will be useful to those who seriously want to dive into bioinformatics.

## Once upon a BLAST

The story of this post begins a few weeks ago when I received a surprising email. I have never read a scientific article giving a credible account of a research process. Only the successful hypotheses and the successful experiments are mentioned in the text — a small minority — and the painful intellectual labor behind discoveries is omitted altogether. Time is precious, and who wants to read endless failure stories? Point well taken. But this unspoken academic pact has sealed what I call the curse of research. In simple words, the curse is that by putting all the emphasis on the results, researchers become blind to the research process because they never discuss it. How to carry out good research? How to discover things? These are the questions that nobody raises (well, almost nobody).

Where did I leave off? Oh, yes... in my mailbox lies an email from David Lipman. For those who don’t know him, David Lipman is the director of the NCBI (the bio-informatics spearhead of the NIH), of which PubMed and GenBank are the most famous children. Incidentally, David is also the creator of BLAST. After a brief exchange on the topic of my previous...