KL Divergence Deeper Look

TagsBasicsCS 228

KL Divergence

A common confusion here is that P and Q are NOT random variables. Rather, they are distributions. So they are functions of xx. They are NOT random, repeat.

KL divergence is a more involved form of cross-entropy.

DKL(PQ)==Exp(x)[logp(x)q(x)]=H(P,Q)H(P)D_{KL}(P || Q) = = E_{x \sim p(x)}\left[\log \frac{p(x)}{q(x)}\right] = H(P, Q) - H(P)

This means that intuitively, DKL is the "loss of encoding efficiency" you get after switching from PP to QQ distributions. The more different the distributions are, the higher the DKL divergence.

DKL identities

We can prove that DKL is always non-negative. It requires a little trick of essentially multiplying by 1 and changing the expectation...

We can prove that DKL is 0 IFF the two distributions are identical (meaning that their PDF's are the same)

We can also prove that the DKL has a "chain rule", which is kinda obvious when you think about the log.

Observations about DKL

This section is purely supposed to be an intuition-based section that discusses what DKL actually does. For a more rigorous treatment, see my 228 notes.

For instructional purposes, say that we are fitting qq over distribution pp by minimizing DKL(QP)D_{KL}(Q || P) ( more on this in 228)

Maximizing the entropy

The first component is qlogq\sum q \log q, which is the negative entropy. By minimizing DKL, you are pushing qq to be as ehtropic as possible

Maximizing Log likelihood

The second component is qlogp-\sum q\log p, which means that you are maximizing the expected log likelihood of pp over a sample of qq.

Overall effect

This means that you find the mode of the distribution of pp and then you "fatten" the distribution out to fill as much of pp as possible.

Asymmetry of DKL

DKL(qp)D_{KL}(q || p) cares about matching the distribution of qq to pp. Fundamentally, it takes expectations across QQ, so if you are dealing with a multi-modal distribution of pp, a minimized DKL(qp)D_{KL}(q||p) WRT a single-mode qq will pick a single mode of pp. However, if you tried minimizing DKL(pq)D_{KL}(p || q), then you are sampling from pp. In this case, your best qq would be a large bump between the two modes of pp.

More information can be found in my section on PGMs / variational inference. This is known as M and I projections.