The Grand Locus / Life for statistical sciences

the Blog

## Longest runs and DNA alignments

Update: The key assumption of the approximation below is that $nq$ is a relatively big number. This happens if the read is very long ($n$ is large) and/or the eror rate is high ($q$ is large). While these assumptions are reasonable with PacBio or Oxford Nanopore technologies, they are not for Illumina reads. In this case, I recommend using the formula based on stick-breaking from this post (it also holds for PacBio and Oxford Nanopore by the way).

The problem of sequence alignment gets a lot of attention from bioinformaticians (the list of alignment software counts more than 200 entries). Yet, the statistical aspect of the problem is often neglected. In the post Once upon a BLAST, David Lipman explained that the breakthrough of BLAST was not a new algorithm, but the careful calibration of a heuristic by a sound statistical framework.

Inspired by this idea, I wanted to work out the probability of identifying best hits in the problem of long read alignments. Since this is a fairly general result and that it may be useful for many similar applications, I post it here for reference.