There are many resources explaining how the Burrows-Wheeler transform works and how it can be used to index a text, but so far I have not found anything explaining what makes it so awesome. I figured I would write a tutorial like this for those who are not afraid of the detail.
Say we have a sequencing run with over 100 million reads. After processing, the reads are between 20 and 25 nucleotide long. We would like to know if these sequences are in the human genome, and if so where.
The first idea would be to use
grep to find out. On my computer, looking for a 20-mer such as
ACGTGTGACGTGATCTGAGC takes about 10 seconds. Nice, but querying 100 million sequences would take more than 30 years. Not using any search index,
grep needs to scan the whole human genome, and this takes time when the file is large.
We could build an index to speed up the search. The simplest would be a dictionary that associates each 20 to 25-mer to its locations in the human genome. The nice thing about dictionaries is that the access time is fast and does...