The Grand Locus / Life for statistical sciences

the Blog

## A tutorial on Burrows-Wheeler indexing methods (3)

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. The first part describes the theoretical background, the second part shows a naive C implementation, and this part shows a more advanced implementation with compression.

The code is written in a very naive style, so you should not use it as a reference for good C code. Once again, the purpose is to highlight the mechanisms of the algorithm, disregarding all other considerations. That said, the code runs so it may be used as a skeleton for your own projects.

The code is available for download as a Github gist. As in the second part, I recommend playing with the variables, and debugging it with gdb to see what happens step by step.

### Constructing the suffix array

First you should get familiar with the first two parts of the tutorial in order to follow the logic of the code below. The file learn_bwt_indexing_compression.c does the same thing as in the second part. The input, the output and the logical flow are the same, but the file is different in many details.

We start with the definition of the occ_t...

## A tutorial on Burrows-Wheeler indexing methods (2)

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. The first part describes the theoretical background, this part shows a naive C implementation of the example followed through in the first part and the third part shows a more advanced implementation with compression.

It makes little sense to implement a Burrows-Wheeler index in a high level language such as Python or JavaScript because we need tight control of the basic data structures. This is why I chose C. The purpose of this post is not to show how Burrows-Wheeler indexes should be implemented, but to help the reader understand how it works in practice. I tried to make the code as clear as possible, without regard for optimization. It is only a plain, vanilla, implementation.

The code runs, but I doubt that it can be used for anything else than demonstrations. First, it is very naive and hard to scale up. Second, it does not use any compression nor down-sampling, which are the mainsprings of Burrows-Wheeler indexes.

The code is available for download as a Github gist. It is interesting for beginners to play with...

## When scientific models fail

Scientific models are more of an art than a science. It is much easier to recognize a good scientific model than to make one of our own. Like for an art, the best way to learn is to look at the work from the masters and take inspiration from them. One of the crown jewels of modern science is undoubtedly Darwin’s Theory of Evolution. I recently realized that I had no idea how Darwin stood against creationism and how he defended his view in regard of the doxa of his time. Digging into this topic turned out to be one the most important lessons I learned about the scientific method... and the lack of it.

Darwin touches vividly upon creationism at the end of “On the Origin of Species”. In his own words, he claims that

It is so easy to hide our ignorance under such expressions as the ‘plan of creation’, ‘unity of design’, etc, and to think that we give an explanation when we only restate a fact.

What strikes me here is that he does not accuse the ‘plan of creation’ and the ‘unity of design’ of not being proper scientific concepts. The real...

## Against recruitment panel interviews

In my first years as a group leader, I had the chance to interview PhD candidates in panels at international calls for students. I quickly stopped interviewing students, but back then I was very surprised that the top candidates often proved less productive than those we had ranked mediocre. How was this possible at all? Panels are unbiased, they combine multiple expertises, they allow for critical discussion, so they should be able to pick the best candidates... right?

It turns out to be less surprising than I thought. Now a little more familiar with the dangers of panel interviews, I decided to see what our colleagues from the academia have to say about it. This is of course where I should have started before interviewing PhD students, but better late than never. If you haven’t met them, let me introduce you to the flaws of the mighty recruitment panel interview...

### 1. The unproved efficiency of panels

A good place to start. Despite forty years of research, the benefit of recruitment panels over single evaluators is still debated. According to a review published in 2009

Findings to date suggest that panel interviews might not necessarily provide the psychometric...

## Three lessons from peer coaching

For a little more than a year, my colleagues and me have been organizing peer coaching sessions for junior group leaders. A typical session consists of four to six of us, and we meet for one morning to discuss the most pressing issues. After a start-up training and some trial and error, we settled for a group coaching method that gave the best result. To give an idea, the “coachee” tells the chairman what he/she wants to solve, then follows a discussion where he/she explains the facts to the coaches who ask as many question as possible. Then the coaches analyze the situation, suggest solutions and make comments meanwhile the coachee has to remain silent and listen. Finally, the coachee summarizes what he/she heard and what steps he/she will take.

With this exercise, we learned a great deal about how to organize such peer coaching sessions in the academia and how to make the best of them, but this is not what this post is about. Instead, I would like to share more important lessons I have learned about working together and using the group as support and source of motivation.

If you...

## One or two tails?

Here is a discussion that I recently had with my colleague John. He approached me with the following request:

“I sent a manuscript to Nature and it is going quite well. Actually the reviewers are rather positive, but one of them asks us to justify better why we used a one-tailed t test to find the main result. What should I write in the methods section?”

“It depends. Why did you use a one-tailed t test?”

“Well, we first tried the standard t test, but it was borderline significant. My student realized that if we used the one-tailed t test, the result was significant so we settled for this variant. We specified this clearly in the text, and I am now surprised that I have to justify it. Isn’t it just an accepted variant of the t test?”

“To be honest, I understand your confusion. The guidelines are rather ill-defined. Actually, Nature journals make it worse by requesting this information for every test, even for those that are only one-tailed like the chi-square.”

“OK, but what should I do now? For instance, how do you justify using a one-tailed t...

## A tutorial on Burrows-Wheeler indexing methods (1)

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. This part describes the theoretical background, the second part shows a naive C implementation of the example below, and the third part shows a more advanced implementation with compression.

There are many resources explaining how the Burrows-Wheeler transform works, but so far I have not found anything explaining what makes it so awesome for indexing and why it is so widely used for short read mapping. I figured I would write such a tutorial for those who are not afraid of the detail.

### The problem

Say we have a sequencing run with over 100 million reads. After processing, the reads are between 20 and 25 nucleotide long. We would like to know if these sequences are in the human genome, and if so where.

The first idea would be to use grep to find out. On my computer, looking for a 20-mer such as ACGTGTGACGTGATCTGAGC takes about 10 seconds. Nice, but querying 100 million sequences would take more than 30 years. Not using any search index, grep needs to scan the whole human genome, and this takes time...

## Did Mendel fake his results?

You went to high school and you learned genetics. You heard about a certain Gregor Mendel who crossed peas and came up with the idea that there is a dominant and a recessive allele. You did not particularly like the guy because there would always be a question about peas with recessive and dominant alleles at the exam. But you grew up, became wiser and just as you started to like him, you heard from someone that he faked his data. You felt disoriented for a while, why annoy you with this stuff at school if it is wrong? But then you came to the conclusion that he just got lucky and that he was right for the wrong reasons. After all, he was just a monk on gardening duties, why would you expect him to understand anything about real science?

### Gregor Mendel

Gregor Mendel was a monk, but he was also a trained scientist. He studied assiduously for twelve years (including about seven years on physics and mathematics), to then become a teacher of physics and natural sciences at the gymnasium of Brno. He prepared his most famous experiment for two years, meticulously checking and choosing his...

## How to teach research?

One of my students once asked me how to become a better researcher. “Great question!” I thought. Instead of a straight and clear answer, I heard myself babbling, as happens when you have no idea what you are talking about. Somewhat disappointed by my non answer, I turned to my colleagues and asked how they teach research. The answer ranged from embarrassed silence to politely switching topic. Strange, I thought, it is our responsibility as principal investigators to educate the junior researchers of our group, so we’d better have some idea of what we are doing.

We have a fairly good idea of how to teach skills, like mathematics, molecular biology etc. But I would argue that research is not a skill. I would argue that it is an attitude. So the real question is how to teach an attitude. There is no foolproof recipe, but here goes a subjective view that I gathered from reading, from discussions and from personal experience.

### 1. Motivate the change

Most students and junior scientists enter the lab with an open mind, ready to learn as many new things as possible. But most of them focus on learning technical skills (at...

## ENCODE data, Principal Components and racism

“Thinking is classifying” wrote Georges Clémenceau*. This tells, in simple words, everything about the obsession of the human mind to keep things tidy. No surprise we ask computers a little help here and there. Is this email spam? Is this online user human? Is this text written by that author? Training machines to put things into the boxes created by our human mind is called supervised learning and it can be very lucrative. But what about the more philosophical cases where machines make their own boxes? Can we reverse the process and put things in boxes created by computers? Unsupervised learning, as it is called, creates a lot of interesting problems where we, humans, are left wondering whether the boxes make any sense.

The mother of all classification techniques is undisputedly Principal Component Analysis (PCA). But let me reassure those who hate PCA and those who never heard of it: I will just touch the surface, and then very briefly. PCA automatically arranges similar items close to each other on a plane. The rest is up to you. Similarity, in particular, depends on a bunch of arbitrary features, size, height, number of legs... In a classical introductory...

« | »