In this tutorial, I would like to explain the basic ideas behind t-distributed Stochastic Neighbor Embedding, better known as t-SNE. There are tons of excellent material out there explaining how t-SNE works. Here, I would like to focus on why it works and what makes t-SNE special among data visualization techniques.
If you are not comfortable with formulas, you should still be able to understand this post, which is intended to be a gentle introduction to t-SNE. The next post will peek under the hood and delve into the mathematics and the technical detail.
One thing we all agree on is that we each have a unique personality. And yet it seems that five character traits are sufficient to sketch the psychological portrait of almost everyone. Surely, such portraits are incomplete, but they capture the most important features to describe someone.
The so-called five factor model is a prime example of dimensionality reduction. It represents diverse and complex data with a handful of numbers. The reduced personality model can be used to compare different individuals, give a quick description of someone, find compatible personalities, predict possible behaviors etc. In many...
The code is written in a very naive style, so you should not use it as a reference for good C code. Once again, the purpose is to highlight the mechanisms of the algorithm, disregarding all other considerations. That said, the code runs so it may be used as a skeleton for your own projects.
The code is available for download as a Github gist. As in the second part, I recommend playing with the variables, and debugging it with gdb to see what happens step by step.
Constructing the suffix array
First you should get familiar with the first two parts of the tutorial in order to follow the logic of the code below. The file
learn_bwt_indexing_compression.c does the same thing as in the second part. The input, the output and the logical flow are the same, but the file is different in many details.
We start with the definition of the
The code runs, but I doubt that it can be used for anything else than demonstrations. First, it is very naive and hard to scale up. Second, it does not use any compression nor down-sampling, which are the mainsprings of Burrows-Wheeler indexes.
The code is available for download as a Github gist. It is interesting for beginners to play with...
There are many resources explaining how the Burrows-Wheeler transform works, but so far I have not found anything explaining what makes it so awesome for indexing and why it is so widely used for short read mapping. I figured I would write such a tutorial for those who are not afraid of the detail.
Say we have a sequencing run with over 100 million reads. After processing, the reads are between 20 and 25 nucleotide long. We would like to know if these sequences are in the human genome, and if so where.
The first idea would be to use
grep to find out. On my computer, looking for a 20-mer such as
ACGTGTGACGTGATCTGAGC takes about 10 seconds. Nice, but querying 100 million sequences would take more than 30 years. Not using any search index,
grep needs to scan the whole human genome, and this takes time...
The story of this post begins a few weeks ago when I received a surprising email. I have never read a scientific article giving a credible account of a research process. Only the successful hypotheses and the successful experiments are mentioned in the text — a small minority — and the painful intellectual labor behind discoveries is omitted altogether. Time is precious, and who wants to read endless failure stories? Point well taken. But this unspoken academic pact has sealed what I call the curse of research. In simple words, the curse is that by putting all the emphasis on the results, researchers become blind to the research process because they never discuss it. How to carry out good research? How to discover things? These are the questions that nobody raises (well, almost nobody).
Where did I leave off? Oh, yes... in my mailbox lies an email from David Lipman. For those who don’t know him, David Lipman is the director of the NCBI (the bio-informatics spearhead of the NIH), of which PubMed and GenBank are the most famous children. Incidentally, David is also the creator of BLAST. After a brief exchange on the topic of my previous...
With this post I inaugurate the focus on series, where I go much more in depth than usual. I could as well have called it the gory details, but focus on sounds more elegant. You might be in for a shock if you are more into easy reading, so the focus on is also here as a warning sign so that you can skip the post altogether if you are not interested in the detail. For those who are, this way please...
In my previous post I exposed the multiple testing problem. Every null hypothesis, true or false, has at least a 5% chance of being rejected (assuming you work at 95% confidence level). By testing the same hypothesis several times, you increase the chances that it will be rejected at least once, which introduces a bias because this one time is much more likely to be noticed, and then published. However, being aware of the illusion does not dissipate it. For this you need insight and statistical tools.
Fail-safe $(n)$ to measure publication bias
Suppose $(n)$ independent research teams test the same null hypothesis, which happens to be true — so not interesting. This means that the...