Continuous RVs

TagsContinuousEE 276

Differential Entropy

Is there real entropy?

To establish the relationship, let’s try to quantize a continuous random variable. We can do this through a notion of riemann integrability. If we split up the distribution into sections of size Δ\Delta, then the PMF of each section is

pi=iΔ(i+1)Δf(x)dxf(xi)Δp_i = \int_{i\Delta}^{(i+1)\Delta}f(x)dx \approx f(x_i)\Delta

We can let xix_i be the center of the interval.

Let’s make a RV using this quantized version, XΔX^\Delta. The entropy can be computed as follows:

Hmmmm. So this is a problem. To get XΔX^\Delta to approach XX, we need to have Δ0\Delta → 0. But as this happens, we see that H(XΔ)H(X^\Delta)→ \infty. So, the entropy of a continuous random varaible is infinite.

Intuitively, this makes sense. It can take on any value; it takes an infinite number of bits to describe the variable to perfect precision.

“Fake” differential entropy

So we notice that H(XΔ)H(X^\Delta) \rightarrow \infty, but we notice that it is composed of two terms: the first term is Δf(xi)logf(xi)\sum \Delta f(x_i)\log f(x_i), which is finite. The second term goes to infinity. But the second term is the same for all continuous RV’s, and it’s shifted vertically by this first term. So instead of figuring out the entropy, could we figure out this first term?

We define differential entropy as limΔ0f(xi)logf(xi)Δ\lim_{\Delta→0} f(x_i)\log f(x_i)\Delta. If the density function is Riemann integrable, we have

This isn’t true entropy, but it has a similar form.

If you see continuous RV’s, you must use this form.

Important properties

Differential entropy can be negative if the interval is small enough.

Furthermore, because scaling aXaX means scaling the bounds (intuitively), we actually add some entropy. We get h(aX)=h(X)+logah(aX) = h(X) + \log |a|.

From results that you can show with KL divergence, we have h(x)h(u)h(x) \leq h(u), where uu is the uniform. But there’s some nuance for maximizing entorpy of continuous RV’s. Stay tuned!

Results from Differential Entropy

Joint and Conditional Entropy

Joint entropy is just a multi-dimensional integral.

Conditional entropy is formulated exactly like the discrete version:

And we also know that h(XY)=h(X,Y)h(Y)h(X | Y) = h(X,Y) - h(Y), which can be helpful. The chain rule applies for continuous spaces.

So it turns out that h(XY)h(X | Y) is not label invariant in XX as previously shown. But it is label invariant with YY, because YY is just an outer expectation. The problem of the stretching only happens inside the logarithm.

Mutual Information

Does mutual information exist for continuous RVs? So actually, yes! We can begin with our discretized XΔ,YΔX^\Delta, Y^\Delta. We have

I(XΔ;YΔ)=H(XΔ)H(XΔYΔ)h(X)logΔh(xy)+logΔI(X^\Delta; Y^\Delta) = H(X^\Delta) - H(X^\Delta | Y^\Delta) \approx h(X) - \log \Delta - h(x | y) + \log \Delta

and you see how this cancels into

I(X;Y)=h(X)h(XY)I(X; Y) = h(X) - h(X | Y)

So this is different from the “fake” entropy. This is REAL mutual information defined in terms of the “fake” entropy.

Relative entropy

Does KL divergence exist for continuous RV’s? Well, we might start with the formulation

D(fg)=Δf(xi)logΔf(xi)Δg(xi)=Δf(xi)logf(xi)g(xi)f(x)logf(x)g(x)dxD(f || g) = \sum \Delta f(x_i)\log \frac{\Delta f(x_i)}{\Delta g(x_i)} = \sum \Delta f(x_i)\log \frac{ f(x_i)}{g(x_i)} \rightarrow \int f(x)\log \frac{f(x)}{g(x)}dx

And you see that the Δ\Delta cancel!

Important Examples

Normal distribution

Let’s say that we have a normal distribution ϕ(x)\phi(x) with variance σ2\sigma^2 and μ=0\mu = 0.

You can just run the derivation as follows. We will start with natural log.

This last step is possible because E[X]=0E[X] = 0, so we can subtract E[X]2E[X]^2 from the numerator, and this yields the variance σ2\sigma^2.

And if you were to account for the extra ee in a base 2 logarithm, you get

h(ϕ)=12log2πeσ2h(\phi) = \frac{1}{2}\log 2\pi e\sigma^2

which will be useful later on.

Some other interesting examples

Maximizing Entropy

If you’re trying to maximize differential entropy across any support, you just make it uniform. You can prove this through the KL trick, where you do D(pu)=h(u)h(p)D(p||u) = h(u) - h(p) in this special case of the uniform, which gets you h(p)h(u)h(p) \leq h(u).

With second moment constraint

But with continuous random variables, you might want to impose another constraint. Say that you wanted E[X2]αE[X^2] \leq \alpha? This is a reasonable constraint, as it’s related to variance. Essentially, you’re inquiring about how you can maximizing the entropy with a certain variance.

Well, you can do the math out (it’s very ugly), but you can show that the maximization of entropy is with a gaussian ϕ\phi such that E[ϕ2]=αE[\phi^2] = \alpha. The proof is the same: you use the KL divergence.

AEP For Differential Entropy

The AEP remains mostly unchanged. We just use the notion of differential entropy instead.

Volumes

However, there does need to be a big change. Previously we talked about the cardinality; it doesn’t make sense here because we are in continuous space. Instead, we talk about the volume of the set. This is defined as

Properties of the AEP

There are exactly the same as the discrete AEP, and actually, the proofs are really the same; just replace the summation with the volume.