The story of the Kullback-Leibler divergence starts in a top secret research facility. In 1951, right after the war, Solomon Kullback and Richard Leibler were working as cryptanalysts for what would soon become the National Security Agency. Three years earlier, Claude Shannon had shaken the academic world by formulating the modern theory of information. Kullback and Leibler immediately saw how this could be useful in statistics and they came up with the concept of information for discrimination, now known as relative entropy or Kullback-Leibler divergence.
The concept was introduced in an original article, and later expanded by Kullback in the book Information Theory and Statistics. It has now found applications in most aspects of information technologies, and most prominently artificial neural networks. In this post, I want to give an advanced introduction on this concept, hoping to make it intuitive.
The original motivation given by Kullback and Leibler is still the best way to expose the main idea, so let us follow their rationale. Suppose that we hesitate between two competing hypotheses $(H_1)$ and $(H_2)$. To make things more concrete, say that we have an encrypted message $(x)$ that may come from two possible...