The Grand Locus / Life for statistical sciences

the Blog

## Stick breaking and DNA alignment

A couple of months ago, I posted an approximate formula for the longest match in the problem of DNA alignment. I recently used it to calibrate a seeding heuristic to map Illumina reads and I was surprised to see that it was not just bad, but epic bad. Upon closer inspection, I realized that the main assumption does not hold when the error rate is small (which is typically the case for Illumina reads). The formula was based on longest runs in Bernoulli trials. This time I present more accurate results with an approach based on a stick breaking process.

### Stick breaking (spacings)

Inserting $k$ mutations at random in a sequencing read will produce $k+1$ (possibly empty) subsequences without errors. The process is analogous to inserting $k$ breaks at random in a stick of length 1, and we can approximate the distribution of the longest subsequence without error by that of the longest fragment when breaking the stick.

The example above illustrates this concept graphically. A sequencing read of 60 nucleotides contains 2 mutations highlighted in red and the longest error-free stretch is the central subsequence of 28 nucleotides. If mutations occur uniformly on the read...

## Longest runs and DNA alignments

Update: The key assumption of the approximation below is that $nq$ is a relatively big number. This happens if the read is very long ($n$ is large) and/or the eror rate is high ($q$ is large). While these assumptions are reasonable with PacBio or Oxford Nanopore technologies, they are not for Illumina reads. In this case, I recommend using the formula based on stick-breaking from this post (it also holds for PacBio and Oxford Nanopore by the way).

The problem of sequence alignment gets a lot of attention from bioinformaticians (the list of alignment software counts more than 200 entries). Yet, the statistical aspect of the problem is often neglected. In the post Once upon a BLAST, David Lipman explained that the breakthrough of BLAST was not a new algorithm, but the careful calibration of a heuristic by a sound statistical framework.

Inspired by this idea, I wanted to work out the probability of identifying best hits in the problem of long read alignments. Since this is a fairly general result and that it may be useful for many similar applications, I post it here for reference.