Fisher information (with a cat)
By Guillaume Filion, filed under
Immanuel,
bias–variance trade-off,
Fisher information,
dialogue.
• 13 December 2022 •
It is still summer but the days are getting shorter (p < 0.05). Edgar and Sofia are playing chess, Immanuel purrs in a sofa next to them. Edgar has been holding his head for a while, thinking about his next move. Sofia starts:
“Something bothers me Immanuel. In the last post, you told us that Fisher information could be defined as a variance, but that is not what I remember from my classes of mathematical statistics.”
“What do you remember, Sofia?”
“Our teacher said it was the curvature of the log-likelihood function around the maximum. More specifically, consider a parametric model $(f(X;\theta))$ where $(X)$ is a random variable and $(\theta)$ is a parameter. Say that the true (but unknown) value of the parameter is $(\theta^*)$. The first terms of the Taylor expansion of the log-likelihood $(\log f(X;\theta))$ around $(\theta^*)$ are
$$\log f(X;\theta^*) + (\theta - \theta^*) \cdot \frac{\partial}{\partial \theta} \log f(X;\theta^*) + \frac{1}{2}(\theta - \theta^*)^2 \cdot \frac{\partial^2}{\partial \theta^2} \log f(X;\theta^*).$$
Now compute the expected value and obtain the approximation below. We call it $(\varphi(\theta))$ to emphasize that it is an explicit function of $(\theta)$. The true value $(\theta^*)$ appears only implicitly through the unknown distribution of $(X)$.
$$\varphi(\theta) = E\left( \log f(X; \theta) \right) = -H(\theta^*) - \frac{1}{2} (\theta - \theta^*)^2 \cdot I(\theta^*) + \ldots $$
We cannot compute $(\varphi(\cdot))$ because it depends on $(\theta^*)$, but we know that $(\varphi(\cdot))$ looks like a downward-facing parabola. More importantly, $(\varphi(\cdot))$ is maximum at $(\theta^*)$, the value at the peak is the Shannon information of the distribution (up to a minus sign) and the “peakiness” is the Fisher information about the parameter.”
“Care to develop?” asks Edgar. For some time already he stopped focusing on the chess game to listen to Sofia.
“Sure. Taking the expectation of the first term in the Taylor series we just get
$$ \int \log f(x;\theta^*) \cdot f(x;\theta^*)dx. $$
Up to a minus sign, this is the information content or the Shannon information of the distribution, even though this usually refers to discrete distributions. It is also called the expected entropy and quantifies how much a distribution spreads over its domain of definition, or conversely concentrates on just some values.”
“Right. And why did you neglect the second term in the Taylor series?”
“I did not. It is strictly zero. I will use the notation $(\partial f(x;\theta=\theta^*) / \partial \theta)$ to signify that I take the derivative and then evaluate at $(\theta^*)$. That is to avoid confusion with evaluating at $(\theta^*)$ first and then taking the derivative. The expected value of the second term then comes as:
$$ \int \frac{\partial \log f(x;\theta)}{\partial \theta} \bigg\rvert_{\theta=\theta^*} \cdot f(x;\theta^*)dx = \int \frac{\partial f(x;\theta)} {\partial \theta} \bigg\rvert_{\theta=\theta^*} \cdot \frac{f(x;\theta^*)}{f(x;\theta^*)}dx. $$
The terms $(f(x;\theta^*))$ in the numerator and in the denominator cancel out and you are left with
$$ \int \frac{\partial f(x;\theta)}{\partial \theta}\bigg\rvert_{\theta=\theta^*} dx = \frac{\partial}{\partial \theta} \int f(x;\theta)dx \bigg\rvert_{\theta=\theta^*} = \frac{\partial}{\partial \theta}1 \bigg\rvert_{\theta=\theta^*} = 0. $$
By construction this term is 0. This is consistent with the fact that $(\theta^*)$ is a maximum of the expected log-likelihood.”
“Indeed. And the third term?”
“Well, this is the point. My teacher defined the Fisher information about the parameter $(\theta)$ as
$$ I(\theta) = - \int \frac{\partial^2 \log f(x;\theta)}{\partial \theta^2}\bigg\rvert_{\theta=\theta^*} \cdot f(x;\theta^*)dx. $$
Up to a minus sign, this is the expected value of the second derivative of the log-likelihood. But Immanuel defined Fisher information as the variance of the first derivative. Apparently, the definitions are equivalent, but my approach is much more intuitive. First you get the Shannon and the Fisher information in the same equation. Second, if you compute $(\varphi(\theta^*)-\varphi(\theta))$ you obtain the Kullback-Leibler divergence between $(f(\cdot;\theta^*))$ and $(f(\cdot;\theta))$ and you see that
$$ \text{KL} \left( \; f(\cdot;\theta^*) \; || \; f(\cdot;\theta) \; \right) \approx \frac{1}{2} (\theta - \theta^*)^2 \cdot I(\theta^*). $$
We have seen that the Kullback-Leibler divergence quantifies the capacity to distinguish two models and there is your answer: the Fisher information tells you how fast this capacity rises as $(\theta)$ moves away from $(\theta^*)$.”
“This makes perfect sense! In retrospect, the definition of Immanuel seems non intuitive. Maybe his definition makes more sense for cats?”
The score as information optimum
Immanuel finally speaks.
“Consider the variable called the score. For a random variable $(X)$ with distribution $(f(X;\theta))$, it is defined as
$$ S_\theta(X) = \frac{\partial \log f(X;\theta)} {\partial \theta}. $$
Pay close attention to the fact that in machine learning, the derivative of the log-likelihood with respect to X is often called the score or the Stein score, which is confusing. Here I use the standard definition in statistics. The score is not a proper statistic in the sense that its value depends on $(\theta)$, which is unknown. It is more useful to see it as a random function: only after $(X)$ is observed do you have an explicit description of the function of $(\theta)$. Before $(X)$ is observed, all you can do is compute its expectation.”
“What is it used for, if it is not a proper statistic?” asks Edgar.
“Patience Master. Consider a statistic $(g)$ that can be computed. If it can be computed, then it depends only on the observed sample $(X)$, so you can write it $(g(X))$. Now let us assume that $(g(X))$ is uncorrelated with the score $(S_\theta(X))$ and evaluate how the expected value of $(g)$ depends on $(\theta)$. Naturally, we compute the derivative of the expected value with respect to $(\theta)$. $$\frac{\partial}{\partial \theta} E[g(X)] = \int \frac{\partial}{\partial \theta} g(X) \cdot f(X;\theta)dX = \int g(X) \cdot \frac{\partial \log f(X;\theta)}{\partial \theta} \cdot f(X;\theta)dX.$$
As soon as we differentiate with respect to $(\theta)$, we obtain an expression that depends on the score. More specifically $$\frac{\partial}{\partial \theta} E[g(X)] = E [ g(X) \cdot S_\theta(X) ].$$
And now...”
“I see it!” interrupts Sofia. “Since $(g(X))$ is uncorrelated with $(S_\theta(X))$, then we can take the product of the expected values, which is $(E[g(X)]\cdot E[S_\theta(X)])$ equal to...”
“... zero” finishes Edgar. “$(E[S_\theta(X)])$ is the second term of the Taylor expansion we were talking about in the previous section. So basically, Immanuel is saying that the expected value of $(g(X))$ is a constant that does not depend on $(\theta)$. Brilliant! Since this expectation does not depend on $(\theta)$, it contains no information whatsoever about the parameter.”
“Right,” continues Sofia, “and the upshot is that any estimator of $(\theta)$ must be correlated with the score. That certainly makes the score special.”
“Or does it?” objects Edgar. “Let me take some pure noise $(\varepsilon)$ and define a new variable $(T_\theta(X) = S_\theta(X) + \varepsilon)$. Every estimator $(g(X))$ must be correlated with $(T_\theta(X))$, so what is special about $(S_\theta(X))$?”
Immanuel seems amused by the turn of the discussion.
“To follow my counter-argument, you will need to know the concept of conditional expectation. Decompose $(g(X))$ as $(E[g(X) | S_\theta(X)] + g(X) - E[g(X) | S_\theta(X)])$.
The second term $(g(X) - E[g(X) | S_\theta(X)])$ has expected value 0 because the average of the conditional expectation is $(E[g(X)])$. So this term does not depend on $(\theta)$. Ergo, if the expected value of $(g(X))$ depends on $(\theta)$, then it is a function of the first term $(E[g(X) | S_\theta(X)])$, which is itself a function of $(S_\theta(X))$.”
Edgar scratches his head, as if to say that he does not see where Immanuel is going.
“Now, to answer your question, note that $(g(X) - E[g(X) | S_\theta(X)])$ is a random variable. As such, it has a variance, say $(V_2)$. If we call $(V_1)$ the variance of $(E[g(X) | S_\theta(X)])$, then the variance of the estimator $(g(X))$ is $(V_1 + V_2)$. It should be clear by now that $(V_2)$ only makes the estimator $(g(X))$ less reliable because it is the variance of a random variable that does not depend on $(\theta)$. In summary, you can represent an estimator as a function of $(S_\theta(X))$ plus some irrelevant noise, i.e., $(g(X) = h\left(S_\theta(X)\right) + \varepsilon)$. The only relevant part is the function of the score. This is what makes the score special.”
“I get it now,” says Edgar.
“Wait a minute!” objects Sofia. “You said that $(S_\theta(X))$ depends on $(\theta)$, so a function of $(S_\theta(X))$ also depends on $(S_\theta(X))$ and $(g(X))$ cannot be an estimator because it depends on $(S_\theta(X))$.”
“Incorrect. The function $(h(\cdot))$ may itself depend on $(\theta)$ in ways that cancel any appearance of $(\theta)$ in $(g(X))$. Consider a Poisson variable with parameter $(\lambda)$; the score is $(-1 + X/\lambda)$. The statistic $(g(X) = \lambda (S_\lambda(X)+1) = X)$ is a deterministic function of $(S_\lambda(X))$ and it does not depend on $(\lambda)$.”
Information as a variance
“I see that the score is a very special function in the sense that it contains all the information about $(\theta)$” says Edgar “but there is still something I do not get. Generally speaking, more variance means less information. Now, if you define the Fisher information as the variance of the score, then more variance means more information. Does it make sense at all?”
“Your view of variance is narrow, Master. Think of principal component analysis where more variance means more information.”
“That’s a good point.”
“Now, consider what happens when the variance of the score is 0.”
“Let me see... The expected value of the score is 0, so $(S_\theta(x) = 0)$ for every $(x)$, which means that for every $(x)$
$$\frac{\partial f(x;\theta)} {\partial \theta} \cdot \frac{1}{f(x;\theta)} = 0, \text{ and thus } \frac{\partial f(x;\theta)} {\partial \theta} = 0.$$
The distribution does not depend on $(\theta)$ so it brings no information about it.”
“Exactly. Conversely, if the variance of the score is large, then $(S_\theta(x))$ is strongly positive for some values of $(x)$ and strongly negative for some other values of $(x)$. This means that for some values of $(x)$, the absolute value of $(\partial f(X;\theta) / \partial \theta)$ is relatively large so the distribution changes substantially around $(x)$ when $(\theta)$ changes. The observed frequency of such values of $(x)$ are thus informative about $(\theta)$, and for this reason it can be said that a lot of information is available about the value of $(\theta)$.”
“Now I see it: when the variance of the score is small, the terms
$$ \frac{\partial f(x;\theta)} {\partial \theta}$$ are all very small and the distribution is practically constant when $(\theta)$ changes.
“It goes even further. To understand why Fisher information is a variance, you can return to our post on the Cramér-Rao lower bound. For an estimator to depend on $(\theta)$, it must correlate with the score.”
“Yes, this is what you showed earlier.”
“And for an estimator $(g(X))$ to be unbiased, its expected value must increase by exactly the right amount when $(\theta)$ increases. This forces $(g(x))$ to be high where $(f(x;\theta))$ will increase and low where $(f(x;\theta))$ will decrease. But $(f(x;\theta))$ increases where $(S_\theta(x))$ is positive and $(f(x;\theta))$ decreases where $(S_\theta(x))$ is negative. As we have seen in the previous post, this constraints the covariance between $(g(X))$ and $(S_\theta(X))$ to be exactly equal to 1.”
“Right. But what does the variance of $(S_\theta(X))$ have to do with this?”
“An estimator is good if its covariance with the score respects some constraints. Remember that the covariance is the correlation times the product of standard deviations, so the squared covariance is always less than the product of variances. Ergo, if the score has a high variance, it allows estimators with a low variance to respect the constraint. Conversely, if the variance of the score is low, only estimators with a high variance can meet the covariance requirement.”
“I would like to say that it makes sense, but I still find it complicated. At least it shows the bias–variance trade-off in an interesting light: releasing the constraint on unbiasedness releases the constraint on the covariance with the score. With a relaxed constraint, you can have estimators with a lower variance but their expectation cannot vary fast enough when you change $(\theta)$. Quite insightful actually!”
« Previous Post | Next Post »
blog comments powered by Disqus