A couple of months ago, I posted an approximate formula for the longest match in the problem of DNA alignment. I recently used it to calibrate a seeding heuristic to map Illumina reads and I was surprised to see that it was not just bad, but epic bad. Upon closer inspection, I realized that the main assumption does not hold when the error rate is small (which is typically the case for Illumina reads). The formula was based on longest runs in Bernoulli trials. This time I present more accurate results with an approach based on a stick breaking process.
Stick breaking (spacings)
Inserting $(k)$ mutations at random in a sequencing read will produce $(k+1)$ (possibly empty) subsequences without errors. The process is analogous to inserting $(k)$ breaks at random in a stick of length 1, and we can approximate the distribution of the longest subsequence without error by that of the longest fragment when breaking the stick.
The example above illustrates this concept graphically. A sequencing read of 60 nucleotides contains 2 mutations highlighted in red and the longest error-free stretch is the central subsequence of 28 nucleotides. If mutations occur uniformly on the read...