MASTER SUMMARY

TagsCS 236
💡
TO DEBRIEF

Representing Models

Explicit

Implicit

Classifier vs Generator

Training Models

Divergences

Surrogate Objective

We will talk about how these are used for each specific method

What you want in a model

Usually it’s easy to sample from a model than to evaluate it. In fact, some models can’t evaluate but they can sample.

Autoregressive

For a lot of these deep models, we are trying to model a very complicated joint distribution. However, if there is good structure in your data, you can consider making a simple joint distribution that you can optimize using log-likelihood. It’s simple to optimize because there are no latent distribtuions.

This represents one type of model structure (structure, not learning method). Every output is conditioned on the past p(x1,,xn)=p(x1)p(x1x2)p(xnxn1,...,x1)p(x_1,…, x_n) = p(x_1) p(x_1|x_2)…p(x_n|x_{n-1},..., x_1)

Types of autoregressive

Autoregressive models act like masked autoencoders, masked such that there is conditional independence (otherwise, there isn’t such a good structure).

We add history to autoregressive models, but this doesn’t make life more difficult; we are still ultimately parameterizing a simple, computable, differentiable distribution.

Maximum Likelihood

Derived from KL objective and estimated with Monte Carlo

VAE

You want to parameterize p(x,z)p(x , z), but you want to do MLE on just xx. What do you do?

Now, from Jensen’s inequality, we can lower-bound the log-prob as

And this gets you a differentiable method to compute the log-prob of p(x)p(x), which can be used for optimization.

Note that any qq works, although some works better (tight when q(z)=p(zx)q(z) = p(z | x), an intractable value.

Further analysis

If you start from the KL divergence, you get a different identity that outlines the tightness of the bound

logp(x)=ELBO+Dkl(q(z)p(zx;θ))\log p(x) = ELBO+ D_{kl}(q(z)||p(z|x;\theta))

Practical Stuff

Generator gradient: propagate through expectation, use Monte Carlo estimate of samples from qq to get

Note that you need some notion of the prior, i.e. you might factor p(x,z)=p(xz)p(z)p(x, z) = p(x | z)p(z). This is the most common approach.

Posterior gradient: use reparameterization trick on a distribution that can be shifted and scaled

We usually amortize the qq by making it a function of xx, i.e. q(zx)q(z | x).

Autoencoder interpretation

You can rearrange the ELBO into a reconstruction objective and a KL regularizer

which shows us that this is just a regularized autoencoder! This is typically the objective that you use. It also shows us that if we aren’t careful, we may deal with posterior collapse, where we learn to ignore zz.

PROS AND CONS

Normalizing Flow

Instead of computing p(xz)p(x | z), we can compute some invertible mapping. We have a complicated xx and a simple zz. We can map from zz to xx through ff (sampling) and xx to zz through f1f^{-1} (evaluation).

You can train a normalizing flow directly through a maximum likelihood objective

The problem is often computing the determinant. We want a determinant that can be computed quickly. Often, this takes place in triangular matrices (product of the diagonal)

NICE

The idea here is to create the ff such that the inverse is very easy to compute. We do this by changing only one half of the variables

The jacobian is therefore easily defined

and this evaluates to 1. If we do arbitrary partitions and compose the ff together, we have good mixing.

Real NVP

We make a simple modification: we also scale one of the components by a function

which also yields a jacobian that is quite easy to compute

Autoregression and flow

If you had an autoregressive model that outputted parameterized gaussians, the reparameterization trick will tell us that you’re actually computing the flow from a set of nn IID gaussians to the xx output space. The creation is autoregressive, so it makes each of these independent noise elements become entangled

To map from zxz→x is slow because autoregressive. On the other hand, with all xx, you can easily compute all zz’s, which perhaps helps with training.

You can speed up the sampling process by inverting the sampling process. What if we made the autoregressive computation happen on the zz’s? By using the sliding window, we still get the autoregressive dependence structure in xx. However, because we know all the zz’s ahead of time, we can compute the weights in parallel

Intuitively, this works because of the data processing inequality: if we generated all xx from the zz, then the zz’s give us more information than the xx’s, so it’s fine to keep on using the zz’s (instead of the xx’s) to generate more xx’s.

Of course, no free lunch: the inverse mapping is now autoregressive.

Note: there’s nothing special about using the gaussian, other than the interpretation of means and variances. We can use any distribution.

Other normalizing flow models

PROS AND CONS

GAN

All the models so far have been optimizing a maximum likelihood objective (i.e. the model should rank data as high likelihood). What if we start to consider different divergences?

To do this, we just do a min-max game between a generator and a discriminator

The optimal discriminator will satisfy

The proof comes from calculus. Ultimately, if this optimal discriminator is reached, the outer objective can be rewritten as the JS divergence, which mean that the GAN is secretly optimizing a different divergence.

However, this is only a theoretical guarantee. In reality, we can’t reach this optimal discriminator, so we just do two gradient operations at once

f-GAN

We can generalize the GAN objective such that there exists a GAN for every ff divergence. Turns out, we can derive (see real notes) as lower bound for the divergence.

which means that you can construct a GAN objective

where ff^* is the fenchel conjugate. Intuitively, the inner objective makes the lower bound close to the true objective, and the outer objective minimizes this true objective. Of course, we might not get there, and that’s fine. It’s simialr to VAE, where you might optimizing a slightly different objective.

Wasserstein GAN

The Wasserstein distance is defined as

And it’s often a smoother distance function between p,qp, q. We can create a similar GAN objective

And in fact, it looks suspiciously like the vanilla GAN. Indeed, the only difference is that we need to keep the lipshitz property of the discriminator by clipping the weights.

Making Latent representations

We can make latent representation by making an explicit objective that encodes the data.

In general, we can start to have different domains that we transfer between

PROS AND CONS

Energy Models

In all previous setups, we had to model an explicit probability distribution. We did this by parameterizing a distribution, or using a categorical distribution. However, this can be limiting. We we implicitly learn through an energy function?

We define the energy-based model as ff, where

pθ(x)=1Z(θ)exp(fθ(x))p_\theta(x) =\frac{1}{Z(\theta)}\exp(f_\theta(x))

Contrastive Divergence

We can try to directly optimize the likelihood, which gets us this objective

Intuitively, we’re trying to maximize the likelihood of the data and contrast this with the model’s own performance.

Sampling

To sample from EBM, you need to use MCMC, which is essentially taking noisy steps towards higher likelihoods. You can also compute directly the score function of an EBM, which allows you to do Langevin MCMC.

Score Matching

Instead of using contrastive divergence, we can just take the Fisher Divergence between the score of the data and the score of the model.

DF(p,q)=12Exp[logp(x)logq(x)22]D_F(p, q) = \frac{1}{2}E_{x\sim p}[||\nabla \log p(x) - \nabla \log q(x)||_2^2]

After a derivation, you can get that

which allows you to optimize without needing the gradient of the data score. More importantly, this allows you to optimize the model without sampling from it.

Noise Contrastive Estimation

Here, you can frame the EBM as a noise discrimination problem, and if you parameterize the DD in special way, you will implicitly train the EBM and the partition function. This all depends on the data samples, not the EBM samples. We can make this better by creating a better baseline for contrastive learning (which can be learned adversarilly)

Variational Paradigm

We can also estimate the lower bound of the log likelihood and get a VAE paradigm

PROS AND CONS