Variational Inference (practicals) ⭐
Tags | CS 330 |
---|
What do we want?
Given a dataset , can we model , the distribution that was drawn from?
We can do this by doing maximum likelihood, i.e. , and this is pretty easy for things like gaussians for categoricals. In fact, MSE loss and cross entropy are all approaches to do exactly this.
But what about more complicated distributions?
- We want to generate images and other complicated data
- we want to represent prediction uncertainty
- We might even want to represent function uncertainty! For example, in few shot learning, the support might be ambiguous. If you have measures of uncertainty, then you might even be able to ask the dataset for more examples, creating an active learning paradigm.
- In this case, a simple distribution just will not work.
Latent variable models
One thing that helps us is the latent variable model
, in which you have two simple models, and , but when you compose them, you get a complicated distribution.
GMMs
A simple example is a Gaussian Mixture Model, in which is a categorial distribution and is a gaussian. In this case, is a discrete variable, and it allows you to get some degree of coverage with .
You can even program a neural network to do GMM regression, which amounts to
and the model outputs means, covariances, and weights.
Moving to general models
Now, GMMs are good, but they aren’t complete. What if we wanted to fit an arbitrary distribution? Well, suppose that you have , and ?
The two distributions are simple, but the are arbitraily complicated. You can actually show that this composition, yields any arbitrary distribution.
A good intuition is that for each slice of , we can create a “stamp” of a gaussian in the graph at any location and any width. You repeat these “stamps” an infinitely many times, and you have infinite resolution.
Once trained, you generate a sample from by sampling from , running it through the network, and then sampling from .
To evaluate the likelihood of a given sample, it’s a little more difficult. You can approximate and do a monte carlo estimation
Training latent variable models: what you can’t do
You might try this
/Untitled.png)
but the integral is intractable. You could try the same monte carlo , but this is very sample inefficient.
So…what can we do? Well, we can propose a lower bound to the likelihood and optimize that. As it turns out, it has some really nice theoretical properties. In the next section, we will see how this comes to be!
Variational Inference
Importance sampling
We can start by settting things up as importance sampling. The key problem with estimating the integral as is that may be very small for a lot of values of , or it may have a very weird coverage that is hard to get at. It’s the classic problem of trying to hit a bullseye but you’re essentially throwing random darts.
Now, if we can sample WRT a distribtuion that models the most likely given (i.e. ), now we’re talking! Now, it’s like we’re using a very calibrated dart throwing method, which allows for greater sample efficiency.
What this looks like is the following. We start with a (bad) assumption that we have a variational
approximation for every data point. We will see how to improve on this later
/Untitled%201.png)
Now, part of the variational approximation is that , which is not necessairlly true. But again, you can think of as the dart thrower. You don’t need to have a professional dart thrower. You just need to have someone good enough to hit the bullseye once in a while.
Creation of ELBO
By using Jensen’s inequality, we know that , which means that we get the following lower bound on the main objective
/Untitled%202.png)
which actually becomes
/Untitled%203.png)
and the last term is the entropy. Now, you can more or less use this directly, but we want to understand what exactly this ELBO is doing!
The first part tries to create some whose likelihood is maximized under . This actually yields a degenerate solution as you can make the as narrow as possible, centered around the mode of . The second term makes sure that we have a sampling distribution that is as wide as possible.
/Untitled%204.png)
So, by maximizing the ELBO objective, you can understand it as jointly optimizing the likelihood of the data and making the likelihood estimator as correct as possible.
At this point, you can also start thinking of it as an approximate EM algorithm, with where is the and is the collection of variational approximators. The E step is fitting the to be wide and centered around the appropiate , and the M step is maximizing this combined term using as the variational approximator. Just like the standard EM algorithm, it is a process of coordinate ascent.
Tightness of the lower bound
Again, while we can use this loss out of the box, we want to investigate a little more into how much bang we’re getting for our buck, and also to continue drawing the EM parallel.
We started this analysis by claiming that we want to be as similar to as possible. Now, bear in mind that because it’s importance sampling, could be anything. We claim now that we have already encoded this restriction in the ELBO. To show this, let’s compute .
/Untitled%205.png)
So, we get that
/Untitled%206.png)
which means two things
- If , then the bound is tight
- yet again, because KL divergence is positive, this constructed is a lower bound (same result, different stories).
This is an entirely different derivation, but it highlights the bounding. Furthermore, we can use this objective to highlight the EM-esque style of variational inference. We can rewrite the equation as
which means that when you’re optimizing the of the ELBO, you’re just minimizing the KL divergence between the variational distribution and the posterior, which is the “E” step. When you’re optimizing for the on the ELBO, you’re optimizing for , dragged behind by some KL divergence. Because the is changing, even if you had a very tight bound at the beginning, during the M step, the bound will increase in size. This is the same for the EM algorithm.
Amortized Variational Inference
So far, there is one problem. The complexity of our model grows with the number of data points, because we need to keep track of a distribution for every point. As it turns out, the solution is very simple! Just use a network in place of all the individual distributions
/Untitled%207.png)
This is very easy for optimization ( the M step) as you just sample through and take the gradient. For the step, we run into a problem
/Untitled%208.png)
the model is in the sampler! Uh oh…that doesn’t look good.
Reparameterization trick
As it turn out, we approximated as a gaussian for a reason. For gaussians, we have
which means that we can pull the into the expectation!
/Untitled%209.png)
There are other methods of dealing with the sampler issue. If we don’t want to use a gaussian, we could use things like REINFORCE, which handles a similar issue for arbitrary sampler functions.
Another way of understanding the bound
So, once again, we are totally done with the derivation. But we can also look at things a different way, which helps motivate the variational autoencoder. You can massage the original ELBO loww as follows:
/Untitled%2010.png)
what this means is that the objective can be written as
/Untitled%2011.png)
The first term you can think of a reconstruction objective. You take in the input , encode in through a sample in , and then try to maximize the decoding through .
The second term you can think of as a regularizer. The distribution should be as similar to the non-information-bearing prior , which is a gaussian. So, put together, we have our variational autoencoder!
Don’t let this scare you! It’s literally just a neural network autoencoder with a sampling procedure added at the bottleneck and a special regularizer.
/Untitled%2012.png)