Entropy, Relative Entropy, Mutual Information

TagsBasicsEE 276

Proof tips

Big chart of conditions, functions, etc

H(X)H(X, Y)H(X | Y)I(X; Y)
f(X)H(f(x))H(X)H(f(x))\leq H(X). Equality if injectiveH(f(X,Y))H(X,Y)H(f(X, Y)) \leq H(X, Y). Equality if injectiveH(f(X)Y)H(XY)H(f(X) | Y) \leq H(X | Y). But H(Xf(Y))H(XY)H(X | f(Y)) \geq H(X | Y)I(f(X);Y)I(X;Y)I(f(X) ; Y) \leq I(X; Y)
conditional on ZH(XZ)H(X)H(X | Z) \leq H(X) H(X,YZ)H(X,Y)H(X, Y | Z) \leq H(X, Y)H(XY,Z)H(XY)H(X | Y, Z)\leq H(X | Y)I(X;YZ)I(X ; Y | Z) is ambiguous.
ConcavityConcave in XConcave in X, YLinear in Y, concave in XConcave in X, convex in Y | X

Entropy

Definition 🐧

Mathematically, it’s defined as

H(X)=p(x)log1p(x)=ExX[log1p(x)]H(X) = \sum p(x) \log \frac{1}{p(x)} = E_{x\sim X}\left[\log \frac{1}{p(x)}\right]

We denote Hb(X)H_b(X) as the entropy WRT base bb, although by the change in base formula it’s just shifted by some constant. Entropy is measured in bits, if you use base 2 (which is the most common)

The logarithm function is actually not completely necessary; it just gives a nice mathematical property and also relevant to communication

The key intuition

You can understand entropy as the average information gained after seeing a sample from a RV. “information” is basically the “surprise” you get from seeing the sample. Another way of understanding entropy is uncertainty

You can also understand H(X) as the average bits needed to represent X.

Remember that entropy is the average surprise, not the maximum that you might observe by seeing something rare.

Key properties 🚀

From first principles

There is a common question of why entropy is actually written in this form. It turns out that it was defined under 3 constraints.

  1. If we have a uniform distribution, H is monotonic in n, the number of categories
  1. If we split the distribution, the entropies should be the weighted sum
  1. H should be continuous in pp

There are other functions that work, but the log is the cleanest.

Aside: Cross Entropy

Cross entropy is the same as entropy, except that we take the expectation under a different distribution

H(P,Q)=ExP[logQ]=p(x)logQ(x)dxH(P, Q) = -E_{x\sim P}[\log Q] = -\int p(x)\log Q(x)dx

Intuitively, if we are going back to our bits example, the cross entropy is the number of bits you would use to represent distribution QQ if we used the optimal encoding scheme from PP.

Joint Entropy

The definition 🐧

We define the joint entropy as just the entropy of multiple variables

This is nothing special, as you can treat X,YX, Y as one random variable. And indeed, you can see very easily that if X,YX, Y are independent, then H(X,Y)=H(X)+H(Y)H(X, Y) = H(X) + H(Y).

Independence bound on Entropy 🚀

The entropy of a joint distribution just be less than the sum of the entropy of the marginals

Which just means that independence yields the highest joint entropy.

Some properties of joint entropy

Conditional Entropy

The definition 🐧

We define the conditional entropy as the entropy of two random variables when conditioned on each other. This, H(YX)H(Y | X) gives the average entropy of P(YX=x)P(Y |X = x), so it’s basically “what’s the entropy in YY that remains after I observe XX?”

We can derive conditional entropy from the joint entropy. If X,YX, Y are not independent, then we have

H(X,Y)=E[log(1/p(x))]+E[log1/p(yx)]H(X, Y) = E[\log(1/p(x))] + E[\log 1/p(y | x)]

The first term is just H(X)H(X), but the second term is a bit foreign. This we can rewrite as

H(YX)=Exp(x)[Eyp(yx)[log1p(yx)]]H(Y | X) = E_{x \sim p(x)}\left[E_{y\sim p(y|x)}\left[\log \frac{1}{p(y | x)}\right]\right]

The tricky part here is that we separated out the two expectations using the probability chain rule. The inner expectation is just entropy of P(YX=x)P(Y | X = x), and we take the expectation of the entropy over xx.

Again, from construction, if X,YX, Y are independent, then H(YX)=H(Y)H(Y|X) = H(Y), etc.

Chain rule of entropy 🚀⭐

This result is actually already obvious from how we defined conditional entropy.

H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y)

This is true for any sort of further conditional, like

And this is also true for any arbitrary number of variables. Intuitively, you can just recursively collapse the variables and use the same chain rule.

Conditioning Reduces Entropy 🚀

We have H(XY)H(X)H(X | Y) \leq H(X), which is intuitive

Here’s a small catch: this is true on average, but it is not always true that H(XY=y)H(X)H(X | Y = y) \leq H(X). Sometimes, observation can increase uncertainty.

Some properties of conditional entropy

Relative Entropy 🐧

The relative entropy is intuitively a distance between distributions, which is why it’s called the KL distance.

The KL distance is not a true metric. It is not symmetric, and does not follow the triangle inequality. So that’s why we also generally call D(pq)D(p || q) the relative entropy of p relative to q.

Generally, we assume that the domain of qq is larger than the domain of pp, or else the DKL will be infinity.

A conditional relative entropy has the same idea as H(XY)H(X | Y), in which you sum across the condition (i.e. you take two expectations)

Relative entropy is non-negative 🚀

We assert that DKL(qp)0D_{KL}(q || p) \geq 0 with equality if and only if q=pq = p

Relative Entropy is Convex 🚀

We assert that D(pq)D(p || q) is convex in (p,q)(p, q). In other words, if we take a linear combination of distributions, the following holds:

This does have some implications. For example, it means that KL divergence is a convex objective, which allows for convex optimization protocols.

Chain rule of relative entropy 🚀

The chain rule applies for relative entropy:

Mutual Information 🐧

We can understand the mutual information of two random variable as the following:

MI in terms of Entropy 🚀

Intuitively, MI is a measure of how much information is shared between two variables. Therefore, it is only logical that this is another way to write mutual information:

I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X ; Y) = H(X) - H(X | Y) = H(Y) - H(Y | X)

The first term is how much entropy the variable naturally has, and the second term is how much entropy remains after conditioning it on the second variable. The difference is how much information is shared by X and Y.

MI properties

Here are some other versions and properties of the identity

Does the location of the semicolon matter?

The location of the semicolon really does matter. The expression I(X1,X2;Y1,Y2)I(X_1, X_2 ; Y_1, Y_2) represents the information shared between the larger XX vector and the larger YY vector. It could be the case that X1,X2X_1, X_2 are highly dependent (they may be the same variable), and Y1,Y2Y_1, Y_2 are highly dependent. But if all XX is independent from all YY, then I(X1,X2;Y1,Y2)=0I(X_1, X_2; Y_1, Y_2) = 0. Moving the semicolon changes which distributions you’re comparing between.

Chain rule of MI 🚀

And the general chain rule applies, which you can show through the definition of entropy

Intuitively, this is just saying the shared information between the joint X1,,XnX_1,…, X_n and YY.

Because MI is symmetric, we can use the chain rule on either side of the semicolon! This is because the conditional isn’t bound to one side or the other.

I(X;Y,Z)=I(X;YZ)+I(X;Z)I(X,Z;Y)=I(X;YZ)+I(Z;Y)I(X ; Y, Z) = I(X; Y | Z) + I(X ; Z) \\ I(X, Z ; Y) = I(X ; Y | Z) + I(Z ; Y)

Note how we can “push” the Z into either side by just adding the appropriate additional information term.

MI is non-negative, and 0 if and only if independent 🚀

Both are obvious from the KL definition

Concave-Convexity of MI 🚀

This is a tad hard to follow, but if (X,Y)=p(x)p(yx)(X, Y) = p(x)p(y | x), then I(X;Y)I(X; Y) is concave of p(x)p(x) with p(yx)p(y|x) fixed, and convex of p(yx)p(y| x) with p(x)p(x) fixed.

Great Diagram

This diagram sums things up. Something is “inside” a circle if it provides information. For example, if we condition on YY, it’s outside of YY because YY no longer provides surprise.

Inequalities to keep in mind

Pinsker’s Inequality

If P,QP, Q are two distributions, then

where

tl;dr the largest divergence of density is upper bounded by a quantity measurable with KL divergence.