The Grand Locus / Life for statistical sciences

Subscribe...

Share...

Why do bioinformatics?

I never planned to do bioinformatics. It just happened because I liked to spend time in front of my computer and my boss was OK with it. Still, as every sane individual, I sometimes think that I should do something else with my life, and I wonder whether I am doing the right thing. On this topic, I recently came across the famous farewell to bioinformatics by Frederick J. Ross, which is worth reading, and from which the most emblematic quote is the now celebrated aphorism

Fuck you, bioinformatics. Eat shit and die.

There is nothing to agree or disagree in this quote, but Frederick gives further detail about his point of view in the post. In short, bioinformaticians are bad programmers, and community-level obfuscation maintains the illusion.

By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques and the slowest languages, by not publishing their algorithms and making their results impossible to replicate, the field managed to reduce its productivity by at least 90%, probably closer to 99%.

There are indeed many issues in the bioinformatics community and I am on Frederick’s side regarding file formats. For instance, I have huge respect for the maintainers of the BAM/SAM format, but here is a quote, straight from the documentation*.

   Structure for core alignment information.
   typedef struct { 
       int32_t tid; 
       int32_t pos; 
       uint32_t bin:16, qual:8, l_qname:8; 
       uint32_t flag:16, n_cigar:16; 
       int32_t l_qseq; 
       int32_t mtid; 
       int32_t mpos; 
       int32_t isize; 
   } bam1_core_t;
   
   Fields
   tid
      chromosome ID, defined by bam_header_t
   pos
      0-based leftmost coordinate
   strand
      strand; 0 for forward and 1 otherwise
   bin
      bin calculated by bam_reg2bin()
   qual
      mapping quality
   l_qname
      length of the query name
   flag
      bitwise flag
   n_cigar
      number of CIGAR operations
   l_qseq
      length of the query sequence (read)
   

You do not need to know anything about C to notice that the description does not match. At some point, the core storage format of BAM has changed (just that!) and the old documentation got mixed up with the new one. So much for a planetary standard.

But no discussion of bioinformatics nonsense would be complete without a benchmark section. In our last software article, we were asked to run our benchmark against an all-pairs algorithm called slidesort. The original benchmark of slidesort concealed two minor details: that it takes months to return, and that it is not an all-pairs algorithm. The email of the maintainers being obsolete, we had to put some effort into finding the authors to ask for explanations. The answer was that it was probably a bug. But “bug” is too polite, “software pollution” is more appropriate.

... so why do bioinformatics?

The answer is simple: because it matters. Even though I deeply agree with Frederick, not everything boils down to working with skillful people. The impact of bioinformatics is unacknowledged but visible. How many discoveries started with a BLAST search? How many experiments were possible only because the human genome is sequenced? Besides, not every problem in bioinformatics is about memory footprint and CPU cycles; in some cases there are lives at stake. Choosing a treatment for cancer patients, deciding upon an abortion based on genotype data, initiating a vaccination campaign... and so much more.

Bioinformatics is biology, and it matters.

Notes:
* The text has since been updated.


« | »



blog comments powered by Disqus

the Blog
Best of
Archive
About

the Lab
The team
Research lines

Blog roll
Simply Stats
Opiniomics
Ivory Idyll
Bits of DNA