ENCODE data, Principal Components and racism

“Thinking is classifying” wrote Georges Clémenceau*. This tells, in simple words, everything about the obsession of the human mind to keep things tidy. No surprise we ask computers a little help here and there. Is this email spam? Is this online user human? Is this text written by that author? Training machines to put things into the boxes created by our human mind is called supervised learning and it can be very lucrative. But what about the more philosophical cases where machines make their own boxes? Can we reverse the process and put things in boxes created by computers? Unsupervised learning, as it is called, creates a lot of interesting problems where we, humans, are left wondering whether the boxes make any sense.

The mother of all classification techniques is undisputedly Principal Component Analysis (PCA). But let me reassure those who hate PCA and those who never heard of it: I will just touch the surface, and then very briefly. PCA automatically arranges similar items close to each other on a plane. The rest is up to you. Similarity, in particular, depends on a bunch of arbitrary features, size, height, number of legs... In a classical introductory...

100% non functional

Panglossian genomics

As most French students of my generation, I had to study Candide, a short philosophical novella written by Voltaire. Back then, I was convinced that Voltaire was an arrogant prick, and I never imagined that his dumb criticism of Leibniz’s theory of pre-established harmony, which he barely understood, would ever echo in my work as a biologist.

But here we are, years have passed, I have made peace with Voltaire, and the ENCODE consortium has issued its major and controversial statement that they find “biochemical functions for 80% of the genome”. As the arguments and the comments flow on the blogs and in the academic press, I cannot help thinking about the words of Dr. Pangloss – incarnating narrow optimism.

Observe, for instance, the nose is formed for spectacles, therefore we wear spectacles. The legs are visibly designed for stockings, accordingly we wear stockings.

What I will call the Panglossian reading of the “80% functional” statement above is the idea that 80% of the genome is meant to be the way it is. The architecture of a given locus is somehow designed to produce what happens there (transcription, transcription enhancing, transcription factor binding etc). Notice...

