The Grand Locus / Life for statistical sciences



Detecting trends in culture

On June 28, 1914, Archduke Franz Ferdinand was assassinated in Sarajevo. One month later, Austria-Hungary declared war on Serbia, to which Russia responded by declaring war on Austria-Hungary, forcing its allies France and Great Britain into the war. In the aftermath, Germany honoured its defensive pact with Austria-Hungary and declared war on France, plunging Europe in a chaos that nobody had predicted.

Cliodynamics, the mathematical approach to History, still has a long way to go to reach the accuracy of Isaac Asimov’s fictive psychohistory. Its closest non science-fiction relative, culturomics, relies on the idea that historical trends are accessible through the digital literature. As Jean-Baptiste Michel explains on TED, the course of History leaves a strong mark on the things we write about, and on the way we write about them.

But historical events are not the only thing we write about. The digital records are mostly about anything we find interesting. Knowing what is talked about is not science-fiction, it is actually fairly easy. More challenging is to know whether a topic is currently on the rise or merely fluctuating, which is a changepoint detection problem. Research on changepoint problems holds few exact results because they are notoriously difficult to obtain. However, I recently came across a surprisingly good approximation using a random walk known as the Brownian bridge. Enthusiastic about my discovery, I decided to test Gombay-Horvath asymptotics on a corpus that I am familiar with.

What are the buzzes in bio-medical research?

I downloaded the 1.98 million PubMed abstracts published since 2012, collected the terms with a Natural Language Processing pipeline described in this post and extracted the terms on the rise. Here is a whimsical anthology of PubMed’s 3674 buzzes.


Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) are a form of immune system present in bacteria and especially archaea. The mechanism was recently elucidated, and understanding the role of the protein Cas9 immediately led to a technological revolution in genome editing. CRISPR/Cas9 makes the difficult and expensive process of gene knock out a simple breeze. The plot on the left shows how briskly the term was picked up in PubMed abstracts around April 2013. In reality the first reports were published from January 2013, but the term CRISPR/Cas9 became the standard way to refer to this technology only later.


Long non coding RNAs become more popular every day. The central dogma of molecular biology has it that biological information flows from DNA to RNA to proteins. However, encoding proteins does not seem to be the main worry of mammalian genomes. One of the lessons learned from the work of large consortia such as ENCODE and FANTOM is that most of the transcripts are not meant to produce proteins. The first blow to the central dogma was given as early as 2005, but the interest in this apparently wasteful transcription spiked only recently. There are about as many lncRNAs as protein-coding genes, the question is what do they do?


Staring at my results, I wondered how there could possibly be a buzz around the term “torrent”. Looking at the context of the hits clarified what this is about. Ion Torrent is a new ion semi-conductor sequencing technology still in testing phase. The promotional video below explains the principle and the advantage over the current standards.

Dabrafenib and Ibrutinib

Some people did their homework in cancer research and others are talking about it. Two anti-cancer drugs were recently approved by the FDA. Dabrafenib is a small molecule inhibitor of the kinase BRAF, which is frequently mutated in melanoma. Inhibiting BRAF blocks the MAP-kinase pathway and causes cancer cell death. Dabrafenib seems to be have less adverse effects than the similar inhibitor Vemurafenib.

Ibrutinib is an inhibitor of Bruton’s tyrosine kinase used for treatment of lymphoma and leukemia. Bruton’s tyrosine kinase regulates cell proliferation and survival of B cells, which is why it was chosen as a target for the development of small molecule inhibitors. Yet the mode of action of Ibrutinib is not fully understood.


The weirdest buzz on PubMed is definitely around the term “CISCOM”, which is the medical literature database of the Research Council for Complementary Medicine. The database is available since 1995, but the CISCOM frenzy starts only in Februrary 2014. All the articles are meta-analyses published by Chinese groups that bear no apparent relationship with each other... and yet the papers are disturbingly similar. Below is a side-by-side comparison of Diagnostic accuracy of contrast-enhanced ultrasound for renal cell carcinoma: a meta-analysis (left) and Role of p16 gene promoter methylation in gastric carcinogenesis: a meta-analysis (right). The papers are published by teams working in different cities.

Arguably, meta-analyses follow a strict template and such coincidence of graphical style can happen. But how to explain the equally disturbing coincidence of the text? I reproduce below a passage from the first paper.

Firstly, our results had lacked sufficient statistical power to assess the accuracy of CEUS due to relatively small sample size and low-quality included studies.

And here is a passage from the second paper, taken at the exact same location of the text (second sentence of the second-last paragraph).

Firstly, our results had lacked sufficient statistical power to assess the exact roles of p16 promoter methylation in gastric carcinogenesis due to relatively small sample size.

Something strange is definitely happening with the CISCOM copycats, and this is the story that I will tell in more detail in a future post.


These are cute examples of how much information lies in analyses from the digital literature. However, the process is far from fully automated discovery. I had to validate the hits manually, and for many of them I still do not have a good understanding of the topic they refer to, i.e. what the buzz is actually about. Computers will detect the trend, but the meaning... that’s another story.

« | »

blog comments powered by Disqus

the Blog
Best of

the Lab
The team
Research lines

Blog roll
Simply Stats
Ivory Idyll
Bits of DNA