A tutorial on Burrows-Wheeler indexing methods (3)

By Guillaume Filion, filed under series: focus on, suffix array, Burrows-Wheeler transform, full text indexing, bioinformatics.

• 14 May 2017 •

This post is part of a series of tutorials on indexing methods based on the Burrows-Wheeler transform. The first part describes the theoretical background, the second part shows a naive C implementation, and this part shows a more advanced implementation with compression.

The code is written in a very naive style, so you should not use it as a reference for good C code. Once again, the purpose is to highlight the mechanisms of the algorithm, disregarding all other considerations. That said, the code runs so it may be used as a skeleton for your own projects.

The code is available for download as a Github gist. As in the second part, I recommend playing with the variables, and debugging it with gdb to see what happens step by step.

Constructing the suffix array

First you should get familiar with the first two parts of the tutorial in order to follow the logic of the code below. The file learn_bwt_indexing_compression.c does the same thing as in the second part. The input, the output and the logical flow are the same, but the file is different in many details.

We start with the definition of the occ_t type that will be the basic building block of the Occ table. The purpose of this definition will be explained at the relevant section. Next, we define a couple of macros (L4, L16 and L32) that will come in handy to manipulate the down-sampled arrays.

Then come the declarations of the variables of interest. Declaring all the variables as global is bad programming practice, but here it allows me to simplify the code and to avoid the discussion of memory management in C. Compared to the second part, BWT and OCC now have fewer entries because of they will be down-sampled. We also declare two new variables: CSA and po$. The array CSA will hold the compressed suffix array, and the integer po$ will hold the position of the terminator $ in the Burrows-Wheeler transformed text, which we need to record for reasons that will be explained below.

Other than that, the only change to the main is the call to compress_SA(), which will compress the suffix array. This is the last step of index construction because we need an intact suffix array to construct the Burrows-Wheeler transformed text, the Occ table and the C array.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct occ_t { uint32_t bits; uint32_t smpl; };
typedef struct occ_t occ_t;

#define L 14               // Length of the text.
#define L4  ((L+3)  / 4 )  // Length down-sampled 4 times.
#define L16 ((L+15) / 16)  // Length down-sampled 16 times.
#define L32 ((L+31) / 32)  // Length down-sampled 32 times.

// Global variables.
char  TXT[L]      = "GATGCGAGAGATG$";
int   SA[L]       = {0,1,2,3,4,5,6,7,8,9,10,11,12,13};
int   CSA[L16]    = {0};
char  BWT[L4]     = {0};
int   C[4]        = {0};
occ_t OCC[4][L32] = {0};
int   po$; // Position of the terminator in BWT.

int main(void) {
   qsort(SA, L, sizeof(int), compare_suffixes);
   construct_BWT();
   construct_C_and_OCC();
   compress_SA(); // Can delete 'SA' from here.
   backward_search("GAGA");
}

The construction of the suffix array is carried out exactly as in the previous part. Not a single character was changed, but I reproduce the function below for completeness.

int compare_suffixes (const void * a, const void * b) {
   const char * suff_a = &TXT[*(int *)a];
   const char * suff_b = &TXT[*(int *)b];
   return strcmp(suff_a, suff_b);
}

Constructing the Burrows-Wheeler transformed text

The first way to compress the index is to realize that DNA has only four letters and that we need only two bits of storage per character. Since a byte is 8 bits, we can store 4 DNA letters per byte. The NUM array gives a numeric code between 0 (binary 0b00) and 3 (binary 0b11) to each DNA letter. Instead of storing the original characters, we will store the numeric codes.

The unit of memory is the byte, so we need to do some bit twiddling. The operator << shifts bits to the left within a given byte or word. For instance, the result of 0b10 << 2 is 0b1000. To “add” a character to a byte, we use the operator |=, which performs an OR and stores the result in the initial variable. For instance, the result of 0b11 |= 0b10 << 2 is 0b1011.

The last bit of syntax that we need to clarify is the symbol % that means “rest of the division by”. For instance 6 % 4 is equal to 2. In the code below NUM[TXT[SA[i]-1]] is the numeric code of the character of the Burrows-Wheeler transformed text at position i. The operation << (2*(i%4)) shifts this code by 0, 2, 4 or 6 bits to the left, depending on i. The values are stored in tmp and every 4 characters (i % 4 == 3) the value of tmp is stored in BWT, and tmp is reset. The remaining characters in tmp are stored in BWT after the for-loop.

The encoding leaves us with a dilemma: what should we do with the terminator $? All the two-bit symbols are already taken. Because this character appears only once, the easiest option is to store this information in a separate variable (po$) and use it ad hoc when needed. If SA[i] is 0, the corresponding character of the Burrows-Wheeler transformed text is $ so we set po$ to i and do not update tmp. The corresponding value of BWT will be 0, so we must be careful not to confuse it with A.

int NUM[256] = { 
  ['A'] = 0, ['C'] = 1, ['G'] = 2, ['T'] = 3,
  ['a'] = 0, ['c'] = 1, ['g'] = 2, ['t'] = 3,
};

void construct_BWT (void) {
   char tmp = 0;
   for (int i = 0 ; i < L ; i++) {
      if (SA[i] > 0) tmp |= (NUM[TXT[SA[i]-1]] << (2*(i%4)));
      else po$ = i;
      if (i % 4 == 3) {
         BWT[i/4] = tmp; // Write 4 characters per byte.
         tmp = 0;
      }
   }
   if (L % 4 != 3) BWT[L4] = tmp;
}

Retrieving the characters of the Burrows-Wheeler transformed text on demand is now more challenging than in the second part. We write a function get_BWT for this purpose.

The argument pos is the position we want to query. If pos is po$, we have to return the code for the terminator $, but since there is none we return -1. In reality, this value is not important because the position of $ is never queried in this implementation.

For the other positions, we first access the byte where the character is stored (BWT[pos/4]), shift the bits to the right (>> (2 * (pos % 4))), and “erase” the higher bits belonging to other characters (& 0b11). The last operation sets the 6 higher bits to 0 and keeps the lower two as they are. For instance, the sequence of characters CAGT is encoded as 0b11100001 (from right to left). We retrieve the G by shifting the bits 4 positions to the right (0b11100001 >> 4 is 0b00001110) and by keeping only the lower two bits (0b00001110 & 0b11 is 0b00000010).

int get_BWT (int pos) {
   if (pos == po$) return -1;
   return (BWT[pos/4] >> (2 * (pos % 4))) & 0b11;
}

Constructing C and Occ

In this version of the code, the Occ table OCC has been declared as a double array of occ_t (see above). Each variable of type occ_t consists of two consecutive unsigned integers of 32 bits, the first called bits, the second called smpl. The second line only declares a shortcut allowing us to write occ_t for struct occ_t.

The first member bits is a bit field of size 32 where a bit is set to 1 if and only if the Burrows-Wheeler transformed text contains the given character at the given position (refer to the first part of the tutorial for detail). The second member smpl contains the value of the Occ table for the given character at the given position, but it is computed only every 32 characters of BWT.

To construct OCC, we define entry, an array of type occ_t to accumulate the counts of 32 consecutive positions of the Burrows-Wheeler transformed text. Each element of the array corresponds to a nucleotide and encodes the local bit field and sampled value for this nucleotide.

The characters of the Burrows-Wheeler transformed text are retrieved one at a time (get_BWT(i)) and the corresponding item of entry is updated. The statement entry[get_BWT(i)].smpl++ increments the count of the decoded nucleotide (stored in the member smpl). The statement entry[get_BWT(i)].bits |= (1 << (31 - (i % 32))) shifts 0b1 up to 31 positions to the left (depending on the value of i) and updates the member bits with the OR operator described in the previous section. If the character of the Burrows-Wheeler transformed text is $, these operations are not performed.

Every 32 positions (i % 32 == 31), the variable entry is stored in OCC and the bit field is reset (but the member smpl is not because it holds the total cumulative occurrences of the characters). The remaining characters of entry are stored in OCC after the for-loop.

The last part of the function fills the C array. As in the previous part of the tutorial, tmp is an accumulator. Note that after scanning the whole Burrows-Wheeler transformed text, entry[j].smpl contains the total count of nucleotide j (numerical encoding), thus tmp contains the cumulative occurrences of each letter (plus 1).

void construct_C_and_OCC (void) {

   occ_t entry[4] = {0};

   for (int i = 0 ; i < L ; i++) {
      if (po$ != i) {
         entry[get_BWT(i)].smpl++;
         entry[get_BWT(i)].bits |= (1 << (31 - (i % 32)));
      }
      if (i % 32 == 31) {
         for (int j = 0 ; j < 4 ; j++) {
            OCC[j][i/32] = entry[j]; // Write entry to 'OCC'.
            entry[j].bits = 0;
         }
      }
   }

   if (L % 32 != 31) { // Write the last entry if needed.
      for (int j = 0 ; j < 4 ; j++) OCC[j][L32-1] = entry[j];
   }

   int tmp = 1;
   for (int j = 0 ; j < 4 ; j++) {
      tmp += entry[j].smpl;
      C[j] = tmp;
   }

}

In the rest of the code, all the queries to OCC are preceded by a query to C, so we lump them together in a single query function get_rank.

The values of bits contain all the necessary information to query the Occ table, but smpl allows us to speed it up. The query consists of a look ahead at the next sampled value (OCC[c][pos/32].smpl), followed by a short backtrack of at most 31 positions.

The backtrack is performed by OCC[c][pos/32].bits << (1+(pos%32)), which shifts the bit field preceding the sampled value up to 32 positions to the left (erasing the upper bits). The number of 1s remaining is computed by the popcount function, here implemented as __builtin_popcount.

Finally, the number of 1s is subtracted from the sampled value and the total is returned.

int get_rank (int c, int pos) {
   return C[c] + OCC[c][pos/32].smpl -
     __builtin_popcount(OCC[c][pos/32].bits << (1+(pos%32)));
}

Compressing the suffix array

The last step of the index construction is to compress the suffix array. In this implementation, I just down-sample it by a factor 16. We could do better by giving each entry the exact number of bits it requires, but this would mean working with unaligned memory (i.e. variables split between consecutive words). To maintain the difficulty at an appropriate level, I prefer to leave out this optimization.

The variable CSA contains every entry of SA out of 16. We simply jump over SA and update CSA on the fly. In a real implementation, we would free the memory allocated to SA after constructing CSA (in fact we would use a single variable SA and resize it after down-sampling).

void compress_SA (void) {
   for (int i = 0 ; i <= L ; i += 16) {
      CSA[i/16] = SA[i];
   }
}

To query CSA, we need to reconstruct the missing values on demand. The first part of the tutorial explains how this is done using the Burrows-Wheeler transformed text recursively.

If the position we need to access is a multiple of 16, we can directly return the value. If the query is the position of $ in the Burrows-Wheeler transformed text, we know that the value of the suffix array is 0. In other cases, we need to find the position of the preceding suffix in the text, which is get_rank(get_BWT(pos),pos-1). With that information, we perform another query and continue the process until we hit a position that is a multiple of 16 (or the position of $). Each time, we increment the return value by 1 to compensate for the fact that the next suffix is one position on the left of the query.

int query_CSA (int pos) {
   if (pos % 16 == 0)   return CSA[pos/16];
   if (pos == po$) return 0;
   return 1 + query_CSA(get_rank(get_BWT(pos),pos-1));
}

Implementing the backward search

Once all these modifications of the index are in place, the backward search is surprisingly similar to the basic implementation showed in the previous part of the tutorial. The only changes are the calls to get_rank and get_CSA that take care of querying the compressed Occ table and the compressed suffix array, respectively.

void backward_search (char * query) {
   int bot = 0;
   int top = L-1;
   for (int pos = strlen(query)-1 ; pos > -1 ; pos--) {
      int c = NUM[query[pos]];
      bot = bot > 0 ? get_rank(c, bot-1) : C[c];
      top = get_rank(c, top)-1;
      if (top < bot) break;
   }
   for (int i = bot ; i <= top ; i++) {
      printf("%s:%d\n", query, query_CSA(i));
   }
}

Epilogue

Compression makes the code substantially more complex than the vanilla implementation. The main reason is that we have to care about the representation of the data. Such optimizations are not possible on high level languages. They are hard to understand and master at first, but they constitute the essence of this indexing method and are worth delving into.