The problem of sequence alignment gets a lot of attention from bioinformaticians (the list of alignment software counts more than 200 entries). Yet, the statistical aspect of the problem is often neglected. In the post Once upon a BLAST, David Lipman explained that the breakthrough of BLAST was not a new algorithm, but the careful calibration of a heuristic by a sound statistical framework.
Inspired by this idea, I wanted to work out the probability of identifying best hits in the problem of long read alignments. Since this is a fairly general result and that it may be useful for many similar applications, I post it here for reference.
Longest runs of 1s
I start with generalities on series of 0s and 1s and focus on the distribution of the longest run of 1s. This general problem has many applications, and I will explain why it is important for sequence alignment in the next section.
Assume that we have a Bernoulli sequence of length $(n)$ such that the probability of a 1 is $(p)$ and the probability of a 0 is $(1-p = q)$. Let $(X_0)$ be the longest run of 1s in the series. As $(n)$ increases...
Before you rush to the comments and express your opinion about this title, let me make something clear. I do not expect anybody to erase their operating system and install Linux after reading this post. Actually, I do not care whether you use Linux or another operating system. All I want is to share what I have learned by using Linux, and why it made me a better scientist.
The mouse that infects your brain
About a year after I started to use Linux, I was surprised to realize how uncomfortable working on Mac and Windows had become. I could not quite pinpoint the problem, but I had the vague feeling that something was missing. Yet everything seemed to be there. When I could finally formulate it, I realized that what was bugging me was the discomfort that all the possible options had been preconceived for me. I could click on option A, I could click on option B, and if I liked neither of those, there was no option C.
But most surprising was that I had never realized this before, because I had no idea of all the things you can do with your computer. I...
A recent Nature editorial entitled “Code share” discusses an update in Nature’s policy regarding the use of software. Interestingly, the subtitle is
Papers in Nature journals should make computer code accessible where possible.
Yes, finally! The last decade was a transition period, which, in the history of bioinformatics will probably be known as the “bioinformatics revolution”. Following the completion of the first genome projects, the demand for bioinformatics rose steadily, to the detriment of biochemistry and genetics, which have now fallen from grace. Something as traumatic cannot happen in a day, and it cannot happen without pain. Actually, the transition is still ongoing and this regularly causes difficulties of all kinds in biology.
One of the most perverse effects of the massive popularization of bioinformatics is that senior scientists were not properly trained for it. This led to an implicit view that bioinformatics is a tool, somewhat like a microscope or a FACS. This explains why the materials and methods section of the first papers using bioinformatics was often reduced to something like “all the bioinformatics analysis were performed using R”. In other words, “we got some bioinformatics software and asked a qualified technician to use it...
About a year after setting up my laboratory, an observation suddenly hit me. All the job applicants were biologists who wanted to do bioinformatics. I was myself trained as an experimental biologist and started bioinformatics during my post-doc. They saw in my laboratory the opportunity to do the same. Indeed, “how did you become a bioinformatician?” is a question that I hear very often.
For lack of a better plan, most people grab a book about Linux or sign up for a Coursera class, try to do a bit every day and... well, just learn bioinformatics. I have seen extremely few people succeed this way. The content inevitably becomes too difficult, motivation decreases and other commitments take over. I will not lie, self-learning bioinformatcs is hard and it is frustrating... but it can be fun if you know how to do it. And most importantly, if you understand your worst enemy: yourself.
Here is a small digest of how it happened for me. I do not mean that this is the only way. I simply hope that this will be useful to those who seriously want to dive into bioinformatics.
Step 1. Get out of your...
It started with a search for trends on PubMed. I am not sure what I expected to find, but it was nothing like the “CISCOM meta-analyses”. Here is the story of how my colleague Lucas Carey (from Universitat Pompeu Fabra) and myself discovered a collection of disturbingly similar scientific papers, and how we got to the bottom of it.
CISCOM is the medical publication database of the Research Council for Complementary Medicine. Available since 1995, it used to be mentioned in 2 to 3 papers per year, until Feburary 2014 when the number of hits started to skyrocket. Since then, “CISCOM” surfs a tsunami of one new hit per week.
But this is not what drew my attention, such waves are not unheard of on PubMed. For instance, the progression of CRISPR/Cas9, is more impressive. It was the titles of the hits that convinced me that something fishy was going on: all of them are on the model “something and something else: a meta-analysis”.
The strange pattern caught my attention, but I somehow missed its significance and put this in the back of my mind. It was only later that Lucas convinced me...
A colleague of mine (let’s call him John) recently put me in a difficult situation. John is a very good immunologist who, as nearly everybody in the field, had to embrace the “omics” revolution. Spirited and curious, he has taken the time to look more closely into statistics and he now has an understanding of most popular parametric and nonparametric tests. One day, he came to me with the following situation.
“I have this gene expression data, you see... I know that a gene is up-regulated, but it is just not significant.
— What do you mean not significant? What test did you use?
— I used the Wilcoxon test. I have five replicates in each condition, and without proof that the distribution is approximately Gaussian, I have been told to use a nonparametric test.
— I agree, that’s probably more safe. Well John, that sucks, it means that you have to do the experiment again.
— But I can’t. This was patient material. This is all the data I have and I cannot get more. Is there a way to boost the significance of the test?
— I see... I already told you about p-hacking, right? You...
In Jorge Luis Borges’s short fiction Tlön, Uqbar, Orbis Tertius, the narrator discovers by accident the existence of a secret encyclopaedia. In this story more than in the others, Borges added many details that are actually true, in such a way that it is hard to tell the reality from the fiction. Needless to say, most would think that the secret encyclopaedia is pure fiction... and they would be wrong. Secret encyclopaedias do exist and they go by the name of “shadow libraries”.
In the course of developing PubCron (a personalized academic literature watch), I learned that PubMed references more than 2,000 new articles daily. For the vast majority of those papers, the authors pay about $1,000 as publication fees. In the bio-medical field alone, this is a $2 million gift to the publishing industry. Every day.
Gift is not the proper term. This money goes to scientific editors and in such amount, it should be sufficient to sustain about 20,000 professional editors. An editor working full time would publish 3 papers per month... not an unreasonable estimate considering that only a fraction of the manuscripts are published. This means...
On June 28, 1914, Archduke Franz Ferdinand was assassinated in Sarajevo. One month later, Austria-Hungary declared war on Serbia, to which Russia responded by declaring war on Austria-Hungary, forcing its allies France and Great Britain into the war. In the aftermath, Germany honoured its defensive pact with Austria-Hungary and declared war on France, plunging Europe in a chaos that nobody had predicted.
Cliodynamics, the mathematical approach to History, still has a long way to go to reach the accuracy of Isaac Asimov’s fictive psychohistory. Its closest non science-fiction relative, culturomics, relies on the idea that historical trends are accessible through the digital literature. As Jean-Baptiste Michel explains on TED, the course of History leaves a strong mark on the things we write about, and on the way we write about them.
But historical events are not the only thing we write about. The digital records are mostly about anything we find interesting. Knowing what is talked about is not science-fiction, it is actually fairly easy. More challenging is to know whether a topic is currently on the rise or merely fluctuating, which is a changepoint detection problem. Research on changepoint problems...
The story of this post begins a few weeks ago when I received a surprising email. I have never read a scientific article giving a credible account of a research process. Only the successful hypotheses and the successful experiments are mentioned in the text — a small minority — and the painful intellectual labor behind discoveries is omitted altogether. Time is precious, and who wants to read endless failure stories? Point well taken. But this unspoken academic pact has sealed what I call the curse of research. In simple words, the curse is that by putting all the emphasis on the results, researchers become blind to the research process because they never discuss it. How to carry out good research? How to discover things? These are the questions that nobody raises (well, almost nobody).
Where did I leave off? Oh, yes... in my mailbox lies an email from David Lipman. For those who don’t know him, David Lipman is the director of the NCBI (the bio-informatics spearhead of the NIH), of which PubMed and GenBank are the most famous children. Incidentally, David is also the creator of BLAST. After a brief exchange on the topic of my previous...
In July 1982, paleontologist Steven Jay Gould was diagnosed with cancer. Facing a median prognosis of only 8 months survival, he used his knowledge of statistics to prepare for the future. As he explains in The Median Isn’t the Message, if half of the patients died of this rare case of mesothelioma within 8 months, those who did not had much better survival. Evaluating his own chances of being in the “survivor” group as high, he planned for long term survival and opted out of the standard treatment. He died 20 years later, from an unrelated disease.
If not the median, then what is the message? Statistics put a disproportionate emphasis on the typical or average behavior, when what matters is sometimes in the extremes. This general blindness to the extremes is responsible for a dreadful lot of confusion in the bio-medical field. One of my all time favorite traps is the extreme value fallacy. Nothing better than an example will explain what it is about.
June babies and anorexia