Discrete Models

TagsCS 236

Caring about discreteness

There’s a lot of discreteness around us, and a lot of structures like graphs are also discrete. How do we model such structures?

Refresher: likelihoods on discretes

On continuous variables, we know how to get the likelihood p(x)p(x) by just evaluating the model. Therefore, optimizing likelihood is akin to maximizing logpθ(x)\log p_\theta(x).

But what about discrete? Well, discrete models are just categorical distributions, so you’re more like looking at a whole list p(x)p(x) and reading out the value at xx. Maximization is akin to minimizing a distribution loss metric to a ground truth (like one-hot) key distribution. Actually, you’re doing the same thing when you’re doing MLE in the continuous space.

We can represent discrete distributions as a giant tabular distribution, but there may also exist other cool ways of parameterizing.

We can also think of discrete distribution samples as a kk dimensional one-hot vector. Contrast this with a k-dimensional sample from a continuous distribution, which has a lot more options. In both scenarios, each vector has an explicit density and that’s encoded in the distribution. This interpretation is a good unifying model for continuous and discrete distributions.

Derivatives Through Expectation

The problem with discrete latents

So previously, when we dealt with latent variable models, we often needed to compute something like this

For example, in a VAE, we needed to compute an expectation over qϕ(zx)q_\phi(z | x). For that, we used the reparameterization trick to compute the gradient. But what if qϕq_\phi can’t be reprarameterized as easily as a gaussian? Actually, one of the big problematic distributions are discrete distributions, which are generally hard to reparameterize.

Log Derivative Trick

There are two failure cases for the reparameterization trick: either qϕq_\phi can’t be reparameterized, or f(z)f(z) isn’t differentiable. The latter case comes when f(z)f(z) is some sort of reward function. However, in RL, there exists policy gradient methods that can solve a similar objective. These policy gradient methods use the log-derivative trick.

This trick takes advantage that ddxlogx=1x\frac{d}{dx}\log x = \frac{1}{x} and a multiply-by-one trick.

This helps us because we expressed a gradient of the expectation distribution as an expectation! Now, we can estimate it through a Monte Carlo sample!

Control Variates

Because we rely on Monte Carlo method, the log derivative is often very high in variance. We know this to be a problem in reinforcement learning.

To resolve this, we often subtract some baseline from f(z)f(z) (we do this in RL as well), and this is known as a control variate

In practice, we use some control variate h(z)h(z) that we use to define a new f^\hat{f}.

Belief Networks

Belief networks are networks that use discrete latent variables, and the objective is the same as the ELBO

Because qϕq_\phi is discrete, we need to use the log-derivative trick with control variates to reduce variance.

Neural Variational Inference and Learning (NVIL)

This is an algorithm that uses the method above. For control variate, they used a constant as well as an input-dependent hϕ(x)h_\phi(x), yielding the objective

Continuous Relaxation

Another option for discrete latent variables is to relax it into a continuous and reprameterizable distribution.

Gumbel Distribution

Given nn iid samples, what is the distribution of

We call this a Gumbel distribution, and it models extreme events. It’s the exponent of an exponent, parameterized by location μ\mu and scale β\beta

Gumbel-Max Reparameterization Trick

Let π\pi be a categorical ditribution with probabilities {π1,π2,πk}\{\pi_1, \pi_2, … \pi_k\}. If you can sample from Gumbel(0, 1), then you can sample g1,,gkg_1, …, g_k from Gumbel(0, 1). We assert that

argmaxi(gi+logπi)\arg \max_i(g_i + \log \pi_i)

is distributed like the categorical distribution πi/jπj\pi_i/\sum_{j}\pi_j. We can represent each sample as a one-hot vector.

This is the reparameterization trick because the computation of the arg-max is deterministic; the randomness comes from the Gumbel.

Gumbel-Softmax Reparameterization

Unfortunately, we can’t differentiate through an argmax, so instead we can soften the one-hot vector definition into a softmax

This temperature is a bias-variance tradeoff: the higher it is, the more correct it becomes. However, the variance actually goes up. The closer τ\tau gets to \infty, the closer the samples become uniform.

Geometric Intuition

Why does this even work? Well, let’s imagine a distribution of kk categories. If we imagine that this is a k-dimensional space, categorical distributions place densities on the vertices.

Gumbel-softmax will soften this by creating a distribution in this “simplex”

Using Gumbel-Softmax trick

Usually it’s enough to just use it in lieu of qϕ(z)q_\phi(z) and you control ϕ=π,τ\phi = \pi, \tau, where π\pi is the categorical parameters and τ\tau is the temperature. You usually anneal τ\tau through time, starting with high τ\tau

If you really need explicitly categorical distribution when doing forward pass, you can use the straight-through gradient trick, which means that you use a hard categorical sample for forward pass and the GumbelSoftmax for gradients in backward pass.