•
the Blog
A tutorial on Burrows-Wheeler indexing methods (3)
By Guillaume Filion, filed under
series: focus on,
suffix array,
Burrows-Wheeler transform,
full text indexing,
bioinformatics.
• 14 May 2017 •
The code is written in a very naive style, so you should not use it as a reference for good C code. Once again, the purpose is to highlight the mechanisms of the algorithm, disregarding all other considerations. That said, the code runs so it may be used as a skeleton for your own projects.
The code is available for download as a Github gist. As in the second part, I recommend playing with the variables, and debugging it with gdb to see what happens step by step.
Constructing the suffix array
First you should get familiar with the first two parts of the tutorial in order to follow the logic of the code below. The file learn_bwt_indexing_compression.c
does the same thing as in the second part. The input, the output and the logical flow are the same, but the file is different in many details.
We start with the definition of the occ_t...
A tutorial on Burrows-Wheeler indexing methods (2)
By Guillaume Filion, filed under
series: focus on,
suffix array,
Burrows-Wheeler transform,
full text indexing,
bioinformatics.
• 07 May 2017 •
It makes little sense to implement a Burrows-Wheeler index in a high level language such as Python or JavaScript because we need tight control of the basic data structures. This is why I chose C. The purpose of this post is not to show how Burrows-Wheeler indexes should be implemented, but to help the reader understand how it works in practice. I tried to make the code as clear as possible, without regard for optimization. It is only a plain, vanilla, implementation.
The code runs, but I doubt that it can be used for anything else than demonstrations. First, it is very naive and hard to scale up. Second, it does not use any compression nor down-sampling, which are the mainsprings of Burrows-Wheeler indexes.
The code is available for download as a Github gist. It is interesting for beginners to play with...
A tutorial on Burrows-Wheeler indexing methods (1)
By Guillaume Filion, filed under
Burrows-Wheeler transform,
suffix array,
series: focus on,
full text indexing,
bioinformatics.
• 04 July 2016 •
There are many resources explaining how the Burrows-Wheeler transform works, but so far I have not found anything explaining what makes it so awesome for indexing and why it is so widely used for short read mapping. I figured I would write such a tutorial for those who are not afraid of the detail.
The problem
Say we have a sequencing run with over 100 million reads. After processing, the reads are between 20 and 25 nucleotide long. We would like to know if these sequences are in the human genome, and if so where.
The first idea would be to use grep
to find out. On my computer, looking for a 20-mer such as ACGTGTGACGTGATCTGAGC
takes about 10 seconds. Nice, but querying 100 million sequences would take more than 30 years. Not using any search index, grep
needs to scan the whole human genome, and this takes time...