A tutorial on Burrows-Wheeler indexing methods (2)

By Guillaume Filion, filed under series: focus on, suffix array, Burrows-Wheeler transform, full text indexing, bioinformatics.

• 7 May 2017 •

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. The first part describes the theoretical background, this part shows a naive C implementation of the example followed through in the first part and the third part shows a more advanced implementation with compression.

It makes little sense to implement a Burrows-Wheeler index in a high level language such as Python or JavaScript because we need tight control of the basic data structures. This is why I chose C. The purpose of this post is not to show how Burrows-Wheeler indexes should be implemented, but to help the reader understand how it works in practice. I tried to make the code as clear as possible, without regard for optimization. It is only a plain, vanilla, implementation.

The code runs, but I doubt that it can be used for anything else than demonstrations. First, it is very naive and hard to scale up. Second, it does not use any compression nor down-sampling, which are the mainsprings of Burrows-Wheeler indexes.

The code is available for download as a Github gist. It is interesting for beginners to play with, see what happens if you change something and run it step by step with gdb.

Constructing the suffix array

The file learn_bwt_indexing_vanilla.c runs the example developed in the first part. You should first get familiar with the material presented there and I recommend that you juggle between the two posts until you get a good intuition of the principles and their implementation on a real machine.

The C memory management is an unnecessary complication. I declared all the variables as global arrays to move this issue out of the way. That also allowed me to simplify function calls in order to highlight what happens inside.

The text of the example is hard coded in the variable TXT. I added the special terminator $ for a total length of 14 characters. All the main does is compute the suffix array of the text, the Occ table and the C array, and it then searches the string GAGA with the backward search.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define L 14 // Length of the text.

// Global variables.
char TXT[L]     = "GATGCGAGAGATG$";
int  SA[L]      = {0,1,2,3,4,5,6,7,8,9,10,11,12,13};
char BWT[L]     = {0};
int  C[4]       = {0};
int  OCC[4][L]  = {0};

int main(void) {
   qsort(SA, L, sizeof(int), compare_suffixes);
   construct_BWT();
   construct_C_and_OCC();
   backward_search("GAGA");
}

The suffix array initially contained the integers from 0 to 13. Computing the suffix array of the text is only a matter of sorting these numbers in the lexicographical order of the corresponding suffixes. Using the function qsort of the C standard library, we just have to write an auxiliary function to compare suffixes.

To compare values a and b of the suffix array, we need to compare the suffixes at positions a and b in the text with the function strcmp of the C standard library. In the code below &TXT[*(int *)a] is the address of the substring of the text staring at position a (and the same goes for b).

int compare_suffixes (const void * a, const void * b) {
   const char * suff_a = &TXT[*(int *)a];
   const char * suff_b = &TXT[*(int *)b];
   return strcmp(suff_a, suff_b);
}

Here, the technical detail is taken care of by the standard C library. At the end of the operation, the array SA has been reordered in place. It so happens that the terminator $ has lower lexicographical order than the English letters in ASCII, so we do not need to give it a special treatment: it will naturally find its way to the first position of the suffix array.

Constructing the Burrows-Wheeler transformed text

The next step is to construct the Burrows-Wheeler transformed text BWT from the suffix array SA. By definition, BWT[m] is the character preceding the suffix stored in SA[m]. There is no character preceding the first, so if SA[m] is 0, we set BWT[m] to the special terminator $. Since this character does not precede any suffix, it will occur only once in the Burrows-Wheeler transformed text.

void construct_BWT (void) {
   for (int i = 0 ; i < L ; i++) {
      if (SA[i] > 0) BWT[i] = TXT[SA[i]-1];
      else           BWT[i] = '$';
   }
}

Below are some complementary exercises about the Burrows-Wheeler transform.

__Exercises__ 1. Say that the first character of the Burrows-Wheeler transformed text is `X`. Where is `X` in the original text? 1. Can the first character of the Burrows-Wheeler transformed text be `$`?

__Answers__ 1. The first entry of the suffix array points to the position of `$` in the text. The first entry of the Burrows-Wheeler transformed text is the preceding character, i.e. the last character of the text (excluding `$`). 1. The answer of the previous question shows that it cannot.

Constructing C and Occ

With the Burrows-Wheeler transformed text, we can construct the ancillary data structures C and Occ. First we need to recode the characters of the alphabet to integers between 0 and 3. The array NUM defined below will come in handy. It is designed so that NUM['A'] is 0, NUM['C'] is 1 etc.

int NUM[256] = {
  ['A'] = 0, ['C'] = 1, ['G'] = 2, ['T'] = 3,
  ['a'] = 0, ['c'] = 1, ['g'] = 2, ['t'] = 3,
};

We first construct OCC. By definition, Occ(X,i) is the number of occurrences of the character X in the Burrows-Wheeler transformed text up until position i. We use the global array C as temporary storage for the cumulative occurrences of the 4 letters of the alphabet. Going through the characters of the Burrows-Wheeler transformed text one at a time, we extract their numeric code (NUM[BWT[i]]) and update the cumulative occurrences in C. If the character is $, none of the 4 letters was used so no update is required. At each step, we copy the current value of C to the columns of OCC.

At the end of the process, we need to write the correct values in the array C. By definition, C[X] is the position of the first X in the suffix array. C[A] is always equal to 1, but the others can be anything. The variable tmp receives the cumulative occurrences of the letters in lexicographical order, and the code below ensures that C[X] will be the position of the first X in the suffix array.

void construct_C_and_OCC (void) {
   for (int i = 0 ; i < L ; i++) {
      if (BWT[i] != '$') C[NUM[BWT[i]]]++;
      for (int j = 0 ; j < 4 ; j++) OCC[j][i] = C[j];
   }
   int tmp = 1;
   for (int j = 0 ; j < 4 ; j++) {
      tmp += C[j];
      C[j] = tmp - C[j];
   }
}

Implementing the backward search

The backward search is described in detail in the first part of the tutorial. We go through the characters of the query one at a time in reverse order and extract their numeric code (c = NUM[query[pos]]). Like for a bisection, we initialize the search interval as the whole suffix array (minus the first position, which cannot be a match). The bottom of the interval is updated by the formula bot = C[c] + OCC[c][bot-1] and the top by the formula top = C[c] + OCC[c][top]-1. If the top of the interval becomes lower than the bottom, the query is not in the text and the search can stop. At the end we just print the positions of the text containing the query.

void backward_search (char * query) {
   int bot = 0;
   int top = L-1;
   for (int pos = strlen(query)-1 ; pos > -1 ; pos--) {
      int c = NUM[query[pos]];
      bot = C[c] + (bot > 0 ? OCC[c][bot-1] : 0);
      top = C[c] + OCC[c][top]-1;
      if (top < bot) break;
   }
   for (int i = bot ; i <= top ; i++) {
      printf("%s: %d\n", query, SA[i]);
   }   
}

The backward search is the most advanced part of the code. Below are some exercises that I recommend solving if you want to understand the algorithm in depth.

__Exercises__ 1. Why test `bot > 0` when updating `bot` in the loop? 1. Show that the update formula is correct when `bot` is 0. 1. Why is `top` lower than `bot` when there is not hit?

__Answers__ 1. The term `bot-1` in the update formula would point outside the domain of definition of `OCC`. This is an error and the most likely outcome is that we would retrieve a junk value. 1. `bot` is 0 at the first iteration, the range of a single character `c` in the suffix array starts at `C[c]` by construction. After the first iteration, `bot` is non-zero because `c` cannot be the terminator `$`. 1. If the hit is not present in the text the character `c` never precedes the suffixes in the range from `bot` to `top` in the suffix array. In this range of the Burrows-Wheeler transformed text, the total count of the character `c` is thus 0, so `OCC[c][top]` equals `OCC[c][bot-1]` and `top` is equal to `bot-1`.

Epilogue

The code of learn_bwt_indexing_vanilla.c is surprisingly small. With 60 lines of code we can compute the suffix array of a text, together with the Occ table and the C array, implement the backward search function and run it on an given example. This is all straightforward because we did not have to down-sample or compress the index. Things will become slightly more involved when we need to implement less naive but more useful data structures.