A tutorial on Burrows-Wheeler indexing methods (3)
By Guillaume Filion, filed under
series: focus on,
suffix array,
Burrows-Wheeler transform,
full text indexing,
bioinformatics.
• 14 May 2017 •
The code is written in a very naive style, so you should not use it as a reference for good C code. Once again, the purpose is to highlight the mechanisms of the algorithm, disregarding all other considerations. That said, the code runs so it may be used as a skeleton for your own projects.
The code is available for download as a Github gist. As in the second part, I recommend playing with the variables, and debugging it with gdb to see what happens step by step.
Constructing the suffix array
First you should get familiar with the first two parts of the tutorial in order to follow the logic of the code below. The file learn_bwt_indexing_compression.c
does the same thing as in the second part. The input, the output and the logical flow are the same, but the file is different in many details.
We start with the definition of the occ_t
type that will be the basic building block of the Occ table. The purpose of this definition will be explained at the relevant section. Next, we define a couple of macros (L4
, L16
and L32
) that will come in handy to manipulate the down-sampled arrays.
Then come the declarations of the variables of interest. Declaring all the variables as global is bad programming practice, but here it allows me to simplify the code and to avoid the discussion of memory management in C. Compared to the second part, BWT
and OCC
now have fewer entries because of they will be down-sampled. We also declare two new variables: CSA
and po$
. The array CSA
will hold the compressed suffix array, and the integer po$
will hold the position of the terminator $
in the Burrows-Wheeler transformed text, which we need to record for reasons that will be explained below.
Other than that, the only change to the main
is the call to compress_SA()
, which will compress the suffix array. This is the last step of index construction because we need an intact suffix array to construct the Burrows-Wheeler transformed text, the Occ table and the C array.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct occ_t { uint32_t bits; uint32_t smpl; };
typedef struct occ_t occ_t;
#define L 14 // Length of the text.
#define L4 ((L+3) / 4 ) // Length down-sampled 4 times.
#define L16 ((L+15) / 16) // Length down-sampled 16 times.
#define L32 ((L+31) / 32) // Length down-sampled 32 times.
// Global variables.
char TXT[L] = "GATGCGAGAGATG$";
int SA[L] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13};
int CSA[L16] = {0};
char BWT[L4] = {0};
int C[4] = {0};
occ_t OCC[4][L32] = {0};
int po$; // Position of the terminator in BWT.
int main(void) {
qsort(SA, L, sizeof(int), compare_suffixes);
construct_BWT();
construct_C_and_OCC();
compress_SA(); // Can delete 'SA' from here.
backward_search("GAGA");
}
The construction of the suffix array is carried out exactly as in the previous part. Not a single character was changed, but I reproduce the function below for completeness.
int compare_suffixes (const void * a, const void * b) {
const char * suff_a = &TXT[*(int *)a];
const char * suff_b = &TXT[*(int *)b];
return strcmp(suff_a, suff_b);
}
Constructing the Burrows-Wheeler transformed text
The first way to compress the index is to realize that DNA has only four letters and that we need only two bits of storage per character. Since a byte is 8 bits, we can store 4 DNA letters per byte. The NUM
array gives a numeric code between 0
(binary 0b00
) and 3
(binary 0b11
) to each DNA letter. Instead of storing the original characters, we will store the numeric codes.
The unit of memory is the byte, so we need to do some bit twiddling. The operator <<
shifts bits to the left within a given byte or word. For instance, the result of 0b10 << 2
is 0b1000
. To “add” a character to a byte, we use the operator |=
, which performs an OR and stores the result in the initial variable. For instance, the result of 0b11 |= 0b10 << 2
is 0b1011
.
The last bit of syntax that we need to clarify is the symbol %
that means “rest of the division by”. For instance 6 % 4
is equal to 2
. In the code below NUM[TXT[SA[i]-1]]
is the numeric code of the character of the Burrows-Wheeler transformed text at position i
. The operation << (2*(i%4))
shifts this code by 0, 2, 4 or 6 bits to the left, depending on i
. The values are stored in tmp
and every 4 characters (i % 4 == 3
) the value of tmp
is stored in BWT
, and tmp
is reset. The remaining characters in tmp
are stored in BWT
after the for-loop.
The encoding leaves us with a dilemma: what should we do with the terminator $
? All the two-bit symbols are already taken. Because this character appears only once, the easiest option is to store this information in a separate variable (po$
) and use it ad hoc when needed. If SA[i]
is 0
, the corresponding character of the Burrows-Wheeler transformed text is $
so we set po$
to i
and do not update tmp
. The corresponding value of BWT
will be 0, so we must be careful not to confuse it with A
.
int NUM[256] = {
['A'] = 0, ['C'] = 1, ['G'] = 2, ['T'] = 3,
['a'] = 0, ['c'] = 1, ['g'] = 2, ['t'] = 3,
};
void construct_BWT (void) {
char tmp = 0;
for (int i = 0 ; i < L ; i++) {
if (SA[i] > 0) tmp |= (NUM[TXT[SA[i]-1]] << (2*(i%4)));
else po$ = i;
if (i % 4 == 3) {
BWT[i/4] = tmp; // Write 4 characters per byte.
tmp = 0;
}
}
if (L % 4 != 3) BWT[L4] = tmp;
}
Retrieving the characters of the Burrows-Wheeler transformed text on demand is now more challenging than in the second part. We write a function get_BWT
for this purpose.
The argument pos
is the position we want to query. If pos
is po$
, we have to return the code for the terminator $
, but since there is none we return -1
. In reality, this value is not important because the position of $
is never queried in this implementation.
For the other positions, we first access the byte where the character is stored (BWT[pos/4]
), shift the bits to the right (>> (2 * (pos % 4))
), and “erase” the higher bits belonging to other characters (& 0b11
). The last operation sets the 6 higher bits to 0 and keeps the lower two as they are. For instance, the sequence of characters CAGT
is encoded as 0b11100001
(from right to left). We retrieve the G
by shifting the bits 4 positions to the right (0b11100001 >> 4
is 0b00001110
) and by keeping only the lower two bits (0b00001110 & 0b11
is 0b00000010
).
int get_BWT (int pos) {
if (pos == po$) return -1;
return (BWT[pos/4] >> (2 * (pos % 4))) & 0b11;
}
Constructing C and Occ
In this version of the code, the Occ table OCC
has been declared as a double array of occ_t
(see above). Each variable of type occ_t
consists of two consecutive unsigned integers of 32 bits, the first called bits
, the second called smpl
. The second line only declares a shortcut allowing us to write occ_t
for struct occ_t
.
The first member bits
is a bit field of size 32 where a bit is set to 1 if and only if the Burrows-Wheeler transformed text contains the given character at the given position (refer to the first part of the tutorial for detail). The second member smpl
contains the value of the Occ table for the given character at the given position, but it is computed only every 32 characters of BWT
.
To construct OCC
, we define entry
, an array of type occ_t
to accumulate the counts of 32 consecutive positions of the Burrows-Wheeler transformed text. Each element of the array corresponds to a nucleotide and encodes the local bit field and sampled value for this nucleotide.
The characters of the Burrows-Wheeler transformed text are retrieved one at a time (get_BWT(i)
) and the corresponding item of entry
is updated. The statement entry[get_BWT(i)].smpl++
increments the count of the decoded nucleotide (stored in the member smpl
). The statement entry[get_BWT(i)].bits |= (1 << (31 - (i % 32)))
shifts 0b1
up to 31 positions to the left (depending on the value of i
) and updates the member bits
with the OR operator described in the previous section. If the character of the Burrows-Wheeler transformed text is $
, these operations are not performed.
Every 32 positions (i % 32 == 31
), the variable entry
is stored in OCC
and the bit field is reset (but the member smpl
is not because it holds the total cumulative occurrences of the characters). The remaining characters of entry
are stored in OCC
after the for-loop.
The last part of the function fills the C
array. As in the previous part of the tutorial, tmp
is an accumulator. Note that after scanning the whole Burrows-Wheeler transformed text, entry[j].smpl
contains the total count of nucleotide j
(numerical encoding), thus tmp
contains the cumulative occurrences of each letter (plus 1).
void construct_C_and_OCC (void) {
occ_t entry[4] = {0};
for (int i = 0 ; i < L ; i++) {
if (po$ != i) {
entry[get_BWT(i)].smpl++;
entry[get_BWT(i)].bits |= (1 << (31 - (i % 32)));
}
if (i % 32 == 31) {
for (int j = 0 ; j < 4 ; j++) {
OCC[j][i/32] = entry[j]; // Write entry to 'OCC'.
entry[j].bits = 0;
}
}
}
if (L % 32 != 31) { // Write the last entry if needed.
for (int j = 0 ; j < 4 ; j++) OCC[j][L32-1] = entry[j];
}
int tmp = 1;
for (int j = 0 ; j < 4 ; j++) {
tmp += entry[j].smpl;
C[j] = tmp;
}
}
In the rest of the code, all the queries to OCC
are preceded by a query to C
, so we lump them together in a single query function get_rank
.
The values of bits
contain all the necessary information to query the Occ table, but smpl
allows us to speed it up. The query consists of a look ahead at the next sampled value (OCC[c][pos/32].smpl
), followed by a short backtrack of at most 31 positions.
The backtrack is performed by OCC[c][pos/32].bits << (1+(pos%32))
, which shifts the bit field preceding the sampled value up to 32 positions to the left (erasing the upper bits). The number of 1s remaining is computed by the popcount function, here implemented as __builtin_popcount
.
Finally, the number of 1s is subtracted from the sampled value and the total is returned.
int get_rank (int c, int pos) {
return C[c] + OCC[c][pos/32].smpl -
__builtin_popcount(OCC[c][pos/32].bits << (1+(pos%32)));
}
Compressing the suffix array
The last step of the index construction is to compress the suffix array. In this implementation, I just down-sample it by a factor 16. We could do better by giving each entry the exact number of bits it requires, but this would mean working with unaligned memory (i.e. variables split between consecutive words). To maintain the difficulty at an appropriate level, I prefer to leave out this optimization.
The variable CSA
contains every entry of SA
out of 16. We simply jump over SA
and update CSA
on the fly. In a real implementation, we would free the memory allocated to SA
after constructing CSA
(in fact we would use a single variable SA
and resize it after down-sampling).
void compress_SA (void) {
for (int i = 0 ; i <= L ; i += 16) {
CSA[i/16] = SA[i];
}
}
To query CSA
, we need to reconstruct the missing values on demand. The first part of the tutorial explains how this is done using the Burrows-Wheeler transformed text recursively.
If the position we need to access is a multiple of 16, we can directly return the value. If the query is the position of $
in the Burrows-Wheeler transformed text, we know that the value of the suffix array is 0. In other cases, we need to find the position of the preceding suffix in the text, which is get_rank(get_BWT(pos),pos-1)
. With that information, we perform another query and continue the process until we hit a position that is a multiple of 16 (or the position of $
). Each time, we increment the return value by 1 to compensate for the fact that the next suffix is one position on the left of the query.
int query_CSA (int pos) {
if (pos % 16 == 0) return CSA[pos/16];
if (pos == po$) return 0;
return 1 + query_CSA(get_rank(get_BWT(pos),pos-1));
}
Implementing the backward search
Once all these modifications of the index are in place, the backward search is surprisingly similar to the basic implementation showed in the previous part of the tutorial. The only changes are the calls to get_rank
and get_CSA
that take care of querying the compressed Occ table and the compressed suffix array, respectively.
void backward_search (char * query) {
int bot = 0;
int top = L-1;
for (int pos = strlen(query)-1 ; pos > -1 ; pos--) {
int c = NUM[query[pos]];
bot = bot > 0 ? get_rank(c, bot-1) : C[c];
top = get_rank(c, top)-1;
if (top < bot) break;
}
for (int i = bot ; i <= top ; i++) {
printf("%s:%d\n", query, query_CSA(i));
}
}
Epilogue
Compression makes the code substantially more complex than the vanilla implementation. The main reason is that we have to care about the representation of the data. Such optimizations are not possible on high level languages. They are hard to understand and master at first, but they constitute the essence of this indexing method and are worth delving into.
« Previous Post | Next Post »
blog comments powered by Disqus