This post is the third part of a tutorial on t-SNE. The first part introduces dimensionality reduction and presents the main ideas of t-SNE. The second part introduces the notion of perplexity. The present post covers the details of the nonlinear embedding.
On the origins of t-SNE
If you are following the field of artificial intelligence, the name Geoffrey Hinton should sound familiar. As it turns out, the “Godfather of Deep Learning” is the author of both t-SNE and its ancestor SNE. This explains why t-SNE has a strong flavor of neural networks. If you already know gradient-descent and variational learning, then you should feel at home. Otherwise no worries: we will keep it relatively simple and we will take the time to explain what happens under the hood.
We have seen previously that t-SNE aims to preserve a relationship between the points, and that this relationship can be thought of as the probability of hopping from one point to the other in a random walk. The focus of this post is to explain what t-SNE does to preserve this relationship in a space of lower dimension.
The story of the Kullback-Leibler divergence starts in a top secret research facility. In 1951, right after the war, Solomon Kullback and Richard Leibler were working as cryptanalysts for what would soon become the National Security Agency. Three years earlier, Claude Shannon had shaken the academic world by formulating the modern theory of information. Kullback and Leibler immediately saw how this could be useful in statistics and they came up with the concept of information for discrimination, now known as relative entropy or Kullback-Leibler divergence.
The concept was introduced in an oringinal article, and later expanded by Kullback in the book Information Theory and Statistics. It has now found applications in most aspects of information technologies, and most prominently artificial neural networks. In this post, I want to give an advanced introduction on this concept, hoping to make it intuitive.
The original motivation given by Kullback and Leibler is still the best way to expose the main idea, so let us follow their rationale. Suppose that we hesitate between two competing hypotheses $(H_1)$ and $(H_2)$. To make things more concrete, say that we have an encrypted message $(x)$ that may come from two possible...
In this tutorial, I would like to explain the basic ideas behind t-distributed Stochastic Neighbor Embedding, better known as t-SNE. There are tons of excellent material out there explaining how t-SNE works. Here, I would like to focus on why it works and what makes t-SNE special among data visualization techniques.
If you are not comfortable with formulas, you should still be able to understand this post, which is intended to be a gentle introduction to t-SNE. The next post will peek under the hood and delve into the mathematics and the technical detail.
One thing we all agree on is that we each have a unique personality. And yet it seems that five character traits are sufficient to sketch the psychological portrait of almost everyone. Surely, such portraits are incomplete, but they capture the most important features to describe someone.
The so-called five factor model is a prime example of dimensionality reduction. It represents diverse and complex data with a handful of numbers. The reduced personality model can be used to compare different individuals, give a quick description of someone, find compatible personalities, predict possible behaviors etc. In many...
The code is written in a very naive style, so you should not use it as a reference for good C code. Once again, the purpose is to highlight the mechanisms of the algorithm, disregarding all other considerations. That said, the code runs so it may be used as a skeleton for your own projects.
The code is available for download as a Github gist. As in the second part, I recommend playing with the variables, and debugging it with gdb to see what happens step by step.
Constructing the suffix array
First you should get familiar with the first two parts of the tutorial in order to follow the logic of the code below. The file
learn_bwt_indexing_compression.c does the same thing as in the second part. The input, the output and the logical flow are the same, but the file is different in many details.
We start with the definition of the
The code runs, but I doubt that it can be used for anything else than demonstrations. First, it is very naive and hard to scale up. Second, it does not use any compression nor down-sampling, which are the mainsprings of Burrows-Wheeler indexes.
The code is available for download as a Github gist. It is interesting for beginners to play with...
There are many resources explaining how the Burrows-Wheeler transform works, but so far I have not found anything explaining what makes it so awesome for indexing and why it is so widely used for short read mapping. I figured I would write such a tutorial for those who are not afraid of the detail.
Say we have a sequencing run with over 100 million reads. After processing, the reads are between 20 and 25 nucleotide long. We would like to know if these sequences are in the human genome, and if so where.
The first idea would be to use
grep to find out. On my computer, looking for a 20-mer such as
ACGTGTGACGTGATCTGAGC takes about 10 seconds. Nice, but querying 100 million sequences would take more than 30 years. Not using any search index,
grep needs to scan the whole human genome, and this takes time...
The story of this post begins a few weeks ago when I received a surprising email. I have never read a scientific article giving a credible account of a research process. Only the successful hypotheses and the successful experiments are mentioned in the text — a small minority — and the painful intellectual labor behind discoveries is omitted altogether. Time is precious, and who wants to read endless failure stories? Point well taken. But this unspoken academic pact has sealed what I call the curse of research. In simple words, the curse is that by putting all the emphasis on the results, researchers become blind to the research process because they never discuss it. How to carry out good research? How to discover things? These are the questions that nobody raises (well, almost nobody).
Where did I leave off? Oh, yes... in my mailbox lies an email from David Lipman. For those who don’t know him, David Lipman is the director of the NCBI (the bio-informatics spearhead of the NIH), of which PubMed and GenBank are the most famous children. Incidentally, David is also the creator of BLAST. After a brief exchange on the topic of my previous...
With this post I inaugurate the focus on series, where I go much more in depth than usual. I could as well have called it the gory details, but focus on sounds more elegant. You might be in for a shock if you are more into easy reading, so the focus on is also here as a warning sign so that you can skip the post altogether if you are not interested in the detail. For those who are, this way please...
In my previous post I exposed the multiple testing problem. Every null hypothesis, true or false, has at least a 5% chance of being rejected (assuming you work at 95% confidence level). By testing the same hypothesis several times, you increase the chances that it will be rejected at least once, which introduces a bias because this one time is much more likely to be noticed, and then published. However, being aware of the illusion does not dissipate it. For this you need insight and statistical tools.
Fail-safe $(n)$ to measure publication bias
Suppose $(n)$ independent research teams test the same null hypothesis, which happens to be true — so not interesting. This means that the...