Exponential Families

Tags

Exponential Family Distributions

Recall that when we learned about MRF, our probability distribution was

where ff was just a concatenation of indicator variables and θc\theta_c was the log factors for each configuration.

We can generalize this!

Definitions

We define an exponential family as

where

Making exponential families

Start with taking explog\exp \log of the expression, which reduces it down into a form that is most similar to the exponential family. Take a look at the coefficient behind xx. This is your ww. You might get a A(θ)A(\theta), where θ\theta is the parameter of your original distribtuion. Your goal is to get A(w)A(w), so derive θ=f(w)\theta = f(w) and plug that in.

Example: bernoulli

Here, we see the formula condense down

(note that the second step of reduction isn’t necessary; you can just consolidate like terms after you take the log.)

You immediately get the following things

and after solving for π\pi in terms of ww, you get

Example: Gaussian

The key for a gaussian is to push the σ\sigma inside the exponent because it is a parameter of the distribution

Therefore, we can define

We can solve for σ\sigma and μ\mu in terms of the natural parameters, which gets us

Other examples
  • any log linear models like MRF
  • multinomial
  • poisson
  • exponential
  • gamma
  • weibull
  • chil square
  • dirchlet
  • geometric

Sufficient Statistic

The sufficient statistic actually has a meaning. It fully “summarizes” the data. This is because if you had multiple data points in DD, a dataset, and you wanted to compute the MLE of ww, you get

and you see that the likelihood for ww depends only on f(xm)f(x_m). A similar result can be said for the posterior p(wD)p(w| D) s

Statistics, generalized

We define a statistic of some random sample X1,...XmX_1, ... X_m as T(X1,...,Xm)T(X_1, ..., X_m). We might recognize the classics, like max, median, mode, mean ,etc.

We define a statistic as sufficient for parameter ww if the conditional distribution of ww given some statistic TT is totally independent of X1,...,XmX_1, ..., X_m. More intuitively, this statistic “boils down” all the important information for us.

Formally, we say that

Sufficient statistics are hard to come by, the they are easy for exponential families.

Proving that ff is sufficient

We want to show that xwf(x)x \perp w | f(x). Our approach: find p(xf=f,w)p(x | f = f, w) and show that the expression doesn’t depend on ww.

essentially, if x,fx, f match, we have the normal probability. Else, we havec 00. We realize that the final expression has one ww term (the exp\exp), but it cancels out.

Derivatives and convexity

Let’s look at some nice properties of the exponential distribution.

Derivative of log normalizer

If you take the derivative of the log normalizer, you can push the derivative into the integral to get

the second step is derived from how the denominator is just the exponential of AA. So the key upshot is that derivative is the moment of the sufficient satistic. In other words, it is the “average” value based on the distribution parameterized by ww

Convexity of log normalizer

You can show (DO THIS AS AN EXERCISE) that the hessian of the log normalizer is just the covariance of f(x)f(x)

which means that A(w)A(w) is a convex function of ww (it means that it has a minimum). When we subtract this from the exponential distribution, we are essentially adding a linear function to a concave function, which makes the log-probability concave.

What does this mean?

To sum things up, the wTfw^Tf term is linear, and the A(w)A(w) term is convex. Therefore, the negative log likelihood is convex.

Minimal representations

Ideally, we want a sufficient statistic that is linearly independent. In other words, there is no af(x)=ca\cdot f(x) = c where aa is non-zero. When this happens, we call it minimal.

If this condition is not satisfied, we call the exponential family as overcomplete. Because this would mean that wTf(x)(w+a)Tf(x)w^Tf(x) \propto (w + a)^Tf(x), or in other words,

p(xw+a)p(xw)p(x | w + a) \propto p(x | w)

We want a unique ww associated with each distribution, but when it’s overcomplete, this is not the case. A good example is the two forms of Bernoulli representation you might come up with

The first one has two values in f(x)f(x), but the vector [1,1]f(x)=1[1, 1]\cdot f(x) = 1 for all xx, which implies that it is overcomplete.

Mapping between moments and distributions

Here, we want to show something very neat about moments and distributions.

The moment of a distribution is the expectation of something. In the case of the exponential family, we derived that the gradient of the log-partition function is the moment of the statistic

If the exponential family is minimal, we can solve for ww from the moment. This is actually a very roudnabout way of saying something that should be relatively obvious. The parameters of the distribution are typically related to some sort of mean.

This observation can be generalized. We will actually show that one way is inference, and the other way is learning.

Moments as inference

Remember that the moment is just the gradient of the log partition function. However, to compute this, we need to do an inference problem!

For example, let’s look at the gradient in the case of an MRF. To compute the expectation, we need to look at each individual component, which corresponds to a marginalization. For example, we can just let f(x)f(x) be a selection function, as shown below, to get our marginalization.

Reverse moments as learning

Suppose that we had QQ as an exponential family and p(x)p(x) as some distribution. To fit QQ to pp, we have the following objective

We call this the moment projection or the M-projection. It can be shown that if you can find some ww^* such that Eq(x;w)[f(x)]=Ep[f(x)]E_{q(x ; w^*)}[f(x)] = E_p[f(x)], then the M-projection of pp onto QQ is just q(x;w)q(x; w^*).

We can prove this through the following idea. Let’s calculate D(pqw)D(p||q_w) and D(pqw)D(p || q_{w^*}) and how that the latter is always smaller, so it has got to be the M-projection. We do this by doing some algebraic hardship

Note that the summation applies to all of the thing after the summation, not just the p(x)f(x)p(x)f(x).

Now, we realize that the summation is just Ep[f(x)]E_{p}[f(x)]. When we moment-matched, we set

Ep[f(x)]=Eq;w[f(x)]E_p[f(x)] =E_{q;w^*}[f(x)]

So we can replace that summation with the equivalent expectation. Note that we use this abuse of notation with the | symbolizing parameterization.

We arrive at the last equation when we add a logexp\log \exp. Essentially, we rebuild the same expression as line 11, but instead of p(x)p(x), we have q(x;w)q(x;w^*). This all relied on the knowledge of moment matching.

This shows that the minimum distance of distributions is a M-matching.

Zooming out

So we saw that deriving the optimal parameters is an optimization problem, and what’s more, it can be done by matching moments. So we have the reverse, from moments to parameters.

Again, recall that the forward mapping is from parameters to moments, which are done through inference-type operations. You take the expectation across the log partition function, which requires you to sample from p(xw)p(x | w), which depends on the ww.

Learning and maximum likelihood

if we have p(xw)p(x | w) is in the exponential family, then DKL between empirical p^\hat p and p(xw)p(x | w) is minimized when

(take a second to think about this). The key upshot is that the MLE is doing moment matching!

You can also show that by manually optimizing the log likelihood, you get the same thing

Maximum Entropy

We can also show that exponential family distributions maximize the entropy H(p)H(p) over all distributions whose moments match

For example, the gaussian is the most entropic distribution whose mean is μ\mu and whose variance is σ\sigma. The big idea is that we make everything as chaotic as possible, given that it satisfies the moments we desire.

We can show this by optimizing the entropy subject to the moment