The Grand Locus / Life for statistical sciences

## A tutorial on Burrows-Wheeler indexing methods

There are many resources explaining how the Burrows-Wheeler transform works and how it can be used to index a text, but so far I have not found anything explaining what makes it so awesome. I figured I would write a tutorial like this for those who are not afraid of the detail.

### The problem

Say we have a sequencing run with over 100 million reads. After processing, the reads are between 20 and 25 nucleotide long. We would like to know if these sequences are in the human genome, and if so where.

The first idea would be to use grep to find out. On my computer, looking for a 20-mer such as ACGTGTGACGTGATCTGAGC takes about 10 seconds. Nice, but querying 100 million sequences would take more than 30 years. Not using any search index, grep needs to scan the whole human genome, and this takes time when the file is large.

We could build an index to speed up the search. The simplest would be a dictionary that associates each 20 to 25-mer to its locations in the human genome. The nice thing about dictionaries is that the access time is fast and does not depend on the size of the text.

The issue is space. Counting 2 bits per nucleotide, 20 to 25-mers take 40 to 50 bits of storage. The human genome contains over 3.2 billion nucleotides, so we need at least 108 GB (in reality many 20 to 25-mers are repeated so this number would be lower). To this we should add the storage required for the locations and the overhead for the dictionary, for a total size well over 200 GB.

So it seems that we are between a rock and a hard place: either we run out of time, or we run out of memory.

### The suffix array

An important step towards modern algorithms was the invention of a data structure called the suffix array of a text. A suffix is the end of a text from a given position. For instance, the 1st suffix of GATGCGAGAGATG is GATGCGAGAGATG itself, and the 10th is GATG (even though the 0-based convention is more popular in this field, I will use the 1-based convention throughout, so that 1st, 2nd etc. translate directly into indices). The suffix array stores the positions of the suffixes sorted in alphabetical order.

Let us construct the suffix array of GATGCGAGAGATG for demonstration. We add a terminator character $, which has lower lexical order than all other characters (this avoids confusion when comparing strings of different lengths). As a consequence, the first suffix in lexical order is always $. Below are the suffixes in sorted order (written vertically) and the suffix array of GATGCGAGAGATG.

How can we use the suffix array of the human genome to solve the query problem above? Since the suffixes are sorted, we can proceed by bisection. We lookup the middle entry of the suffix array, which points to a particular position of the human genome. We compare the query to the sequence at that position, and depending on the result we continue bisecting either on the left half or on the right half of the suffix array.

Let us use this technique to find how many occurrences of GAGA are in GATGCGAGAGATG. The suffix array has 14 entries, so we lookup the entry at position 7, which points to the suffix G$. Since GAGA > G$, we continue bisecting in the positions of the suffix array between 8 and 14 included. The middle entry is at position 11 and points to the suffix GATGCGAGAGATG$. Since GAGA < GATGCGAGAGATG$, we continue bisecting between positions 8 and 10 included. The middle entry is at position 9 and points to the suffix GAGATG$. This time we have a hit because GAGATG$ starts with GAGA. By continuing the bisection, we would find that suffixes at position 8 and 9 start with GAGA, so the query is present two times in the text.

Why is this an improvement? The suffix array of the human genome has $N$ = 3.2 billion entries, so we need at most $\lfloor\log_2(N)\rfloor +1 = 32$ steps to find out whether any query is present or not. We would need a few extra steps to find out the number of occurrences in case it is present. Each step consists of 2 memory accesses, one in the suffix array, one in the human genome. Counting approximately 100 ns per memory access and ignoring the time for string comparison, this brings us to 6-7 us per query. Now what about the space requirements? We can encode every position of the genome with a 4 byte integer, and we need to store 3.2 billion entries, so we need 12.8 GB. Still a lot, but notice that this approach solves all the practical difficulties associated with dictionaries. We can easily look for sequences of any length in the suffix array.

Exercises

1. What is the first value of a suffix array?
2. What is the suffix array of CTGTGATGTCGTAG?
3. What justifies to ignore the time to compare strings?
4. Adding the reverse complement of a genome to the suffix array allows to query both strands. How big is this suffix array for the human genome?

1. The length of the text +1 (or the length of the text in the 0-based convention).
2. [15,13,6,10,1,14,5,11,8,3,12,9,4,7,2]
3. The first memory access to the genome will usually be a last level cache miss (approx. 100 ns). The following nucleotides are contiguous and can be prefetched, so comparing them to the query will be much faster (2-3 ns). No more than a few nucleotides need to be compared, so the time is dominated by the initial cache miss.
4. Adding the reverse complement brings the size of the text to 6.4 billion characters. Since 33 bits are now required to store the largest value, the total size is 26.4 GB (not 25.6 GB).

### The Burrows-Wheeler transform

The Burrows-Wheeler transform of a text is a permutation of this text. To construct it, we need to sort all the suffixes, but we replace the whole suffix by the preceding letter. For the suffix equal to the text itself, we write the terminator $ instead. Using the previous sketch, we obtain the Burrows-Wheeler transform of GATGCGAGAGATG as GGGGGGTCAA$TAA, as shown below.

Before explaining how this will help, let us highlight a fundamental property of the Burrows-Wheeler transformed text. Note that in the sketch above, the nucleotides appear in sorted order in the row immediately below the Burrows-Wheeler transform. See that the first G in the Burrows-Wheeler transformed text is the one preceding $. It is also the first G to appear in the sorted text. The second G in the Burrows-Wheeler transformed text is the one preceding AGAGATG$. It is also the second G in the sorted text. The first T in the Burrows-Wheeler transformed text is the one preceding G$. It is also the first T in the sorted text. More generally, the letters of the Burrows-Wheeler transformed text are in the same “relative” order as in the sorted text. Exercises 1. Why is this the case? 2. What is the Burrows-Wheeler transform of CTGTGATGTCGTAG? Answers 1. Take all the suffixes that start with G, say. Their relative lexicographic order is the same with or without this G. The order with the G is that of the suffix array, and the order without the G is that of the Burrows-Wheeler transformed text. The same argument holds for the other characters of the alphabet. 2. GTGT$ATCTTGGGAC

### The backward search

Let us illustrate how the Burrows-Wheeler transformed text can be used to look for GAGA in the text. All the occurrences of GAGA in the text end with A. Each A is also the first letter of some suffix of the text. Because such suffixes all start with A, they are stored next to each other in the suffix array, namely between positions 2 and 5. However, only those suffixes preceded by a G can potentially contain the query. This is exactly the information encoded by the Burrows-Wheeler transformed text. In this concrete example, all the As are preceded by a G, so the text contains 4 suffixes that start with GA.

Since those 4 suffixes start in the same way, they are stored next to each other in the suffix array, but where are they exactly? The Gs preceding the As are the 2nd, 3rd, 4th and 5th Gs in the order of the Burrow-Wheeler transformed text, so they are also the 2nd, 3rd, 4th and 5th Gs in the sorted text. Knowing that the position of the first G is 7, the suffixes that start with GA are thus stored in positions 8 to 11 of the suffix array.

The process continues as we read the query backwards. The target suffixes must be preceded by an A. According to the Burrows-Wheeler transformed text, two suffixes are preceded by an A, and since their relative order is the same as in the sorted text, we know that the suffixes starting with AGA are at positions 2 and 3 of the suffix array. Finally, to complete the suffix GAGA, the target suffixes must be preceded by a G. According to the Burrows-Wheeler transformed text, both suffixes at positions 2 and 3 are preceded by G and, again, since their relative order is the same as in the sorted text, we know that the suffixes starting with GAGA are at positions 8 and 9 of the suffix array.

### The Occ table

Real world implementations of the backward search are based on a data structure called the Occ table (Occ stands for occurrence). The table contains the cumulative number of occurrences of each character in the Burrows-Wheeler transformed text. It also comes together with an array called C that stores the position of the first occurrence of each character in the sorted text.

With these data structures, the backward search of GAGA goes as follows: The suffixes starting with A are stored in the suffix array between positions C[A] = 2 and C[C]-1 = 5. C appears in the expression of the right bound because it is the character after A in lexicographical order. To know how many of those suffixes are preceded by a G, we use the Occ table. The number of Gs appearing in the Burrows-Wheeler transformed text before position 2 is Occ(G,1) = 1, and the number appearing up until position 5 is Occ(G,5) = 5, so the number of Gs preceding the suffixes between positions 2 and 5 is 5-1 = 4. There are thus 4 suffixes starting with GA, corresponding to the 2nd till 5th G in the order of the sorted text, i.e. between positions C[G] + Occ(G,1) = 7+1 = 8 and C[G] + Occ(G,5)-1 = 7+5-1 = 11. To continue the process, we find the positions of the suffixes starting with AGA between positions C[A] + Occ(A,7) = 2+0 = 2 and C[A] + Occ(A,11)-1 = 2+2-1=3. Finally, we find the positions of the suffixes starting with GAGA between positions C[G] + Occ(G,1) = 7+1 = 8 and C[G] + Occ(G,3)-1 = 7+3-1 = 9.

How does this algorithm perform? For a query sequence of length $k$, there are at most $k$ steps to perform, each of which with two memory accesses (the queries to Occ) for a total of $2k$ memory accesses. This number is independent of the size of the text, which is an extraordinary achievement.

What about the memory requirements? The Occ table contains one row per character of the alphabet and one column per character in the text, for a total of $\sigma N$ entries, where $N$ is the size of the text and $\sigma$ is the size of the alphabet. Each entry must be able to encode a number potentially as high as $N$, which requires $\lfloor\log_2(N)\rfloor+1$ bits, so the total size of the Occ table is $\sigma N (\lfloor\log_2(N)\rfloor + 1)$ bits. For the human genome, this represents 51.2 GB. Considering that we also need the 12.8 GB suffix array, the benefits of the backward search seem doubtful. But wait until you read the next section.

Exercises

1. The King James Authorized Bible has 3,116,480 characters using a 76 letter alphabet. What is the size of the Occ table for this text?
2. On average, how long is the backward search (in number of steps) if the query is absent from the human genome?
3. Say that at step $i$ of the backward search, the candidate suffixes are stored between positions $b_i$ and $e_i$ of the suffix array. If the next nucleotide of the query is $q_{i+1}$, what are $b_{i+1}$ and $e_{i+1}$?

1. We need $\lfloor\log_2(3116480)\rfloor +1 = 22$ bits per entry. Since this Occ table has 3,116,480 columns and 76 rows, we need a total of 411,375,360 bits or 51 MB.
2. Assuming that the nucleotides of the human genome are random, the average number of occurrences of a 16-mer is approximately 0.75 < 1. So the backward search will typically stop after 16-17 steps, even if the query is longer.
3. $b_{i+1}$ = C[$q_{i+1}$] + Occ[$q_{i+1}$, $b_{i+1}$ - 1], $e_{i+1}$ = C[$q_{i+1}$] + Occ[$q_{i+1}$, $e_{i+1}$] - 1.

### Compression

What is truly awesome about Burrows-Wheeler indexing is that the search can be performed on compressed indexes. I will illustrate the simplest of many available options, in which the Occ table and the suffix array are down-sampled.

The Occ table stores the cumulative frequencies of the characters in the Burrows-Wheeler transformed text. If, instead, we stored their actual occurrence with a 0/1 encoding, we could still perform the backward search by counting the 1s until the given position of the text. To compress Occ, we keep one column out of 32 and we use the binary table to compute the missing values on demand. The picture above illustrates how Occ(G, 3275) is computed in a compressed table. The largest multiple of 32 before 3275 is 3264. Looking up Occ(G, 3264), we find 830. We still need to add the Gs between positions 3276 and 3264, which we do by counting the 1s in the binary table between these positions (this is called the popcount operation). The result is 2, so Occ(G, 3275) = 830+2 = 832.

The size of the binary table is $\sigma N$ bits. For the human genome, it represents 1.6 GB, so the total size of the Occ table is 1.6 + 51.2/32 = 3.2 GB. It is possible to store the down-sampled values of the Occ table together with the next 32 values of the binary table in a 64 bit integer, so that a single memory access is necessary for each query in the Occ table. The popcount can be computed much faster than a memory access, so the backward search runs at the same speed, but with a memory footprint of 3.2 GB instead of 51.2 GB.

We also compress the suffix array by keeping one value out of 32. The task is now to compute the missing values on demand. To show how this is done, let us illustrate a fundamental connection between the Burrows-Wheeler transformed text, the suffix array and the Occ table using once again the text GATGCGAGAGATG.

The suffix \$ is stored at position 1 of the suffix array and contains the value 14. The Burrows-Wheeler transformed text at this position is G. Notice that C[G]+Occ(G,1)-1 = 7+1-1 = 7, which is the position of the suffix array that stores 13. At this new position, the Burrows-Wheeler transformed text is T. Notice that C[T]+Occ(T,7)-1 = 13+1-1 = 13, which is the position of the suffix array that stores 12. At this new position, the Burrows-Wheeler transformed text is A. Notice that C[A]+Occ(A,13)-1 = 2+3-1 = 4, which is the position of the suffix array that stores 11. More generally, if $x$ is stored at position $m$ of the suffix array, $x-1$ is stored at position C[.] + Occ(.,$m$)-1, where . is the value of the Burrows-Wheeler transformed text at position $m$.

Exercises

1. Why is this the case?
2. Where is $x-k$ stored in the suffix array $(k > 0)$?

1. If value $x$ is stored at position $m$ of the suffix array, it means that the suffix starting at position $x$ of the text is the $m$-th in lexicographical order. The Burrows-Wheeler transformed text stores the preceding nucleotide at position $m$, which corresponds to the suffix starting at position $x-1$ of the text. We have seen previously that this suffix is stored in the suffix array at position C[.] + Occ(.,$m$)-1.
2. To find out, iterate the process above $k$ times.

This property allows us to find where the previous suffix is stored in the suffix array from the Burrows-Wheeler transformed text and the Occ table. To compute a value of the suffix array on demand, we just have to iterate this procedure, computing the position of the previous suffix at each step until this position is a multiple of 32. At that point the value of the suffix array is known, and so is the position of the query in the text. Each step of this procedure takes two memory accesses, one in the Burrows-Wheeler transformed text and one in the Occ table. Since we need 32 attempts on average to find a position that is a multiple of 32, we need on average 64 memory accesses to compute the values of the suffix array (each step consists of one memory access in the Occ table and one memory access in the Burrows-Wheeler transformed text). For the human genome, the size of the down-sampled suffix array is 400 MB and that of the Burrows-Wheeler transformed text is 800 MB.

The total size of this index is 4.4 GB. Obviously, it is possible to further down-sample the Occ table and the suffix array, at the cost of increasing memory accesses. Compressing the binary representation of the Occ table is also possible, but it requires more advanced knowledge of succinct data structures. With the index above, counting the occurrences of a query of length $k$ requires $2k$ memory accesses and knowing the position of each of them requires 64 memory accesses. For a unique 25-mer, this is about a hundred memory accesses or approximately 10 µs (in practice much less when the cache is hot). With this method, we can process millions of queries. The memory footprint is rather high, but not for the kind of workstation that is routinely found in a bioinformatics laboratory.

### Epilogue

The most natural application of Burrows-Wheeler indexing is to perform the seeding step of the DNA alignment problem. For instance, the popular mapper BWA uses the compression methods presented above. It is interesting to note that direct bisection on the suffix array has become a competitive algorithm as computers gained memory (the STAR mapper takes this approach). It costs only 64 memory accesses to find all the occurrences of a query in the human genome, vs that many per occurrence in the algorithm with compressed indexes. The weakness of the bisection method is that it still takes 64 accesses to know that a sequence is absent vs about 32 with compressed indexes. Other issues to consider are that the backward search has a good cache performance and that some recent developments such as the bi-directional Burrows-Wheeler transform give an edge to compression-based methods.

It is also worth mentioning that the Burrows-Wheeler transform is exceptionally well adapted to indexing the human genome. It gives simple algorithms for sub-string queries, and compressing the Occ table is most efficient on small alphabets. For proteins or natural languages, one would use indexes adapted to bigger alphabet, such as wavelet trees, or word-based indexes.