## Focus on: the Kullback-Leibler divergence

By Guillaume Filion, filed under
Kullback-Leibler divergence,
series: focus on.

• 23 June 2019 •

The story of the Kullback-Leibler divergence starts in a top secret research facility. In 1951, right after the war, Solomon Kullback and Richard Leibler were working as cryptanalysts for what would soon become the National Security Agency. Three years earlier, Claude Shannon had shaken the academic world by formulating the modern theory of information. Kullback and Leibler immediately saw how this could be useful in statistics and they came up with the concept of *information for discrimination*, now known as *relative entropy* or *Kullback-Leibler divergence*.

The concept was introduced in an oringinal article, and later expanded by Kullback in the book Information Theory and Statistics. It has now found applications in most aspects of information technologies, and most prominently artificial neural networks. In this post, I want to give an advanced introduction on this concept, hoping to make it intuitive.

### Discriminating information

The original motivation given by Kullback and Leibler is still the best way to expose the main idea, so let us follow their rationale. Suppose that we hesitate between two competing hypotheses $(H_1)$ and $(H_2)$. To make things more concrete, say that we have an encrypted message $(x)$ that may come from two possible sources $(H_1)$ or $(H_2)$. Say we cannot decrypt either source, but from previous messages we learned the typical frequencies of each symbol in the encryption protocols. Now, we can use Bayes’ formula to see how our belief is updated from $(P(H_i))$ to $(P(H_i|x))$. Taking the log ratios of the two hypotheses, we obtain: $$\log \frac{P(H_1|x)}{P(H_2|x)}- \log \frac{P(H_1)}{P(H_2)} = \log \frac{P(x|H_1)}{P(x|H_2)}.$$

In other words, the quantity
$$\log \frac{P(x|H_1)}{P(x|H_2)}$$
measures how much the message $(x)$ changes our belief, and as such it deserves to be called the *discriminating information* of the message $(x)$. A message can speak in favor of $(H_1)$, in favor of $(H_2)$, or be uninformative if it has the same probability of occurrence under both sources. From there arises the question of how much discriminating information we can expect from future messages. Will we be able to identify the source? And if so, with what confidence?

The answer depends which is the actual source of the message. If it comes from $(H_1)$, the expected discriminating information is $$\text{KL}(H_1||H_2) = \int P(x|H_1) \log \frac{P(x|H_1)}{P(x|H_2)} dx = -\int P(x|H_1) \log \frac{P(x|H_2)}{P(x|H_1)} dx.$$

If it comes from $(H_2)$, the symbols $(H_1)$ and $(H_2)$ are swapped in the formula above. This is the definition of the Kullback-Leibler divergence between $(H_1)$ and $(H_2)$, or more accurately between the probability distributions $(P(\cdot|H_1))$ and $(P(cdot|H_2))$.

In summary, $(\text{KL}(H_1||H_2))$ quantifies the _expected change of belief in $(H_1)$_ when observing a random event from $(H_1)$. If it is equal to $(\log(10))$, say, events from $(H_1)$ will increase the confidence in $(H_1)$ by a factor 10 on average.

### The origin of asymmetry

The conceptual difficulty with the Kullback-Leibler divergence is that in general $(\text{KL}(H_1||H_2))$ is not the same as $(\text{KL}(H_2||H_1))$. Is this asymmetry important, and if so how should we interpret it?

The asymmetry is indeed important because the Kullback-Leibler divergence is event-centric. As counter-intuitive as it seems, one of the hypotheses may be easier to validate than the other. This is the case when there are more events that can are compatible only with this hypothesis.

For instance, imagine that the German nazis design an encryption protocol where the letter ß may be in the cipher text. It turns out to be inconvenient for their Italian allies because their typewriters do not have this letter. So the Germans decide to use a different encryption protocol when sending messages to their Italian allies. Can we know who the message is for just by looking at the cipher text.

The answer is asymmetric: if the cipher text contains the letter ß, then the message is obviously intended for the Germans; but if it does not, then there is a doubt. This is reflected in the Kullback-Leibler divergence. Assuming that the symbols have an independent uniform distribution (one of 26 letters under $(H_1)$ and one of 27 letters under $(H_2)$), the Kullback-Leibler divergence of $(H_1)$ versus $(H_2)$ is $$\text{KL}(H_1||H_2) = \sum_{i=1}^{26} \frac{1}{26} \log \frac{27}{26} \approx 0.0377.$$

In other words, if the message is intended for the Italians, every letter of the cipher text increases my confidence that this is the case by approximately 3.8% – because $(\exp(0.0377) \approx 1.038)$. Intuitively, if the message is short, it will be difficult to ascertain whom it is intended for. Now, the Kullback-Leibler divergence of $(H_2)$ versus $(H_1)$ is $$\text{KL}(H_2||H_1) = \sum_{i=1}^{26} \frac{1}{27} \log \frac{26}{27} + \frac{1}{27} \log \frac{1}{0} \frac{1}{17} = +\infty.$$

This is somewhat counter-intuitive: a message intended for the Germans does not always contain the letter ß, so why is the Kullback-Leibler divergence infinite? This is a property of infinite expectations; it only means that the discriminating information will fluctuate without bounds as we read the letters of the cipher text. The confidence for $(H_2)$ decreases until we a ß is read, when suddenly there is no longer any doubt – we have infinite information.

Remember that the the Kullback-Leibler divergence is the change of belief expected for events from one hypothesis. So, if $(\text{KL}(H_1||H_2))$ is higher than $(\text{KL}(H_2|H_1))$, then we know that we will need more observations to validate $(H_2)$ than to validate $(H_1)$ – *i.e.* to reach the same degree of confidence in $(H_2)$ than in $(H_1)$.

### Staying positive

Are there cases where the average discriminating information of events from $(H_1)$ is actually in favor of $(H_2)$? Adequately, the answer is no. This is one of the properties that make the Kullback-Leibler divergence most useful, as we will see below. For now, let us prove this statement, which is easy to do using Jensen’s inequality. Recalling that the logarithm is a convex function, we find $$\text{KL}(H_1||H_2) = -\int P(x|H_1) \log \frac{P(x|H_2)}{P(x|H_1)} dx \geq -\log \int P(x|H_1) \frac{P(x|H_2)}{P(x|H_1)} = 0.$$

This is an essential property of the Kullback-Leibler divergence. It means that every probability distribution brings about events that will allow it to be discriminated from every other probability distribution (if we observe a sufficient amount of events).

### Not knowing the truth

Reality is in general unknown and inaccessible. All we have are models with variable degrees of precision. For instance, in the examples above the true frequencies of the letters in a given encryption scheme are unknown: the analysts have to use empirical estimates. Does this make the Kullback-Leibler divergence useless?

On the contrary! In a general sense, the Kullback-Leibler divergence justifies using models that only approximate reality. To see how, let us consider the case that message $(x)$ comes from an unknown source $(H_0)$. There are several candidate sources $(H_j)$ but none of them is identical to $(H_0)$. So we will be wrong whatever we decide, and since we know nothing about $(H_0)$ (not even that it is not in our candidate set) it seems we will never learn anything useful.

And yet, the most straightforward approach proves very fertile. Let us just pick the model $(H_j)$ for which message $(x)$ is the most likely – this is known as the maximum likelihood method. In the large sample limit, the law of large number applies and empirical averages become increasingly close to expected values (assuming they are finite). Specifically, for large $(n)$ we have $$\log P(x_1, \ldots, x_n |H_j) = \sum_{i=1}^n \log P(x_i |H_j) \approx n \int P(x|H_0) \log P(x |H_j) dx.$$

The $(H_j)$ that maximizes $(\log P(x_1, \ldots, x_n |H_j))$ is the same as the $(H_j)$ that maximizes $$\int P(x|H_0) \log P(x |H_j) dx - \int P(x|H_0) \log P(x |H_0) dx = -\text{KL}(H_0||H_j).$$

So in the large sample limit, choosing the most likely hypothesis $(H_j)$ is the same as choosing the hypothesis for which the truth is the *least discriminating*.

### Variational learning

We just saw that minimizing $(\text{KL}(H_0||\cdot))$ is conceptually equivalent to an estimation problem. But what about minimizing $(\text{KL}(\cdot||H_0))$? In this case, we search the distribution $(P(\cdot|H_j))$ that minimizes $$\int P(x|H_j) \log \frac{1}{P(x |H_0)} dx - \int P(x|H_j) \log \frac{1}{P(x |H_j)} dx.$$

The left term is known as the cross entropy between $(P(\cdot|H_j))$ and $(P(\cdot|H_0))$; the right term is known as the entropy of $(P(\cdot|H_j))$. We are thus looking for a distribution that is a good approximation of the truth (left term low) and that has a high entropy (right term high). This is a case of penalized estimation, where the entropy term favours simple over complex models.

For a concrete example, imagine that the Americans have secured the plain text associated with a cipher text encrypted by the Germans. The Russians ask this information, but the Americans do not want to share their finding. They decide to give the plain text along with a fake cipher text that will send cryptanalysts on the wrong tracks. But how can they generate a realistic German cipher text if they do not know how the encryption work?

Imagine that the Americans found a way to know the approximate number of space characters in the plain texts. They suppose the Russians figured it out too, so they want to produce a fake that would give the correct number of space characters. They opt for a simple generator where the characters are independent of each other, but how to know their frequencies?

Let $(P(x|m))$ be the probability that the cipher text is $(x)$ given that the plain text contains $(m)$ space characters. This distribution is unknown and impossible to compute. The approximate distribution is $(Q(x))$. It seems absurd to minimize the Kullback-Leibler divergence between $(Q(x))$ and a distribution that cannot be computed, but fortune favors the fools so we try anyway. We observe that $$\text{KL}(Q(\cdot)||P(\cdot|m)) = \log P(m) - \int Q(x) \log \frac{1}{Q(x)} dx - \int Q(x) \log P(x;m) dx.$$

Surprisingly, we can minimize this expression with respect to $(Q(x))$: the first term is a constant, the second is the entropy of the distribution $(Q(x))$ and it can be computed analytically, and the third term can be expressed as $$\int Q(x) \log \Big(P(m|x)P(x)\Big) dx \approx \frac{1}{N}sum_{i=1}^N \log P(x_i) + \log P(m|x_i),$$ where $(x_i)$ is a random cipher text sampled from the distribution $(Q(\cdot))$ – the sum of the last two terms is called the variational or evidence lower bound, often abbreviated ELBO.

The approximation holds because of the law of large numbers. $(P(x_i))$ can be computed from the frequency of the characters in the available cipher texts and $(P(m|x_i))$ is the discovered relationship with the number of space characters. Finding the optimum of this expression is not straightforward, but since it can be computed, this can be done (using gradient descent for instance).

In conclusion, it is possible to approximate $(P(x|m))$ if we are able to compute $(P(m|x))$, in the sense that we can find $(Q(\cdot))$ with smallest Kullback-Leibler divergence to $(P(\cdot|m))$ within a certain class of distributions. By definition, this is the distribution that produces the smallest amount of events that would allow someone to figure out that the sample is not actually drawn from $(P(x|m))$. This method is the basis of the variational approach, most highlighted for its success in artificial neural networks (see for instance variational autoencoders).

### Conclusion

The Kullback-Leibler divergence is now part of the standard machine learning toolbox. It is often thought of as an asymmetric distance between probability distributions. The asymmetry turns out to be fertile: it distinguishes a *sampling* distribution (to generate events) and an *appraisal* distribution (to measure their probability). It also highlights a key difference between *prediction* and *discrimination*. Minimizing the Kullback-Leibler divergence with respect to the second distribution amounts to predicting the frequencies of the events as best we can. Minimizing it with respect to the first distribution amounts to finding a distribution that least discriminates the second.

In variational approaches, the Kullback-Leibler divergence is used primarily because it yields tractable equations. It is still interesting to understand the interpretation and why it often gives good results in practice.

blog comments powered by Disqus