Normalizing Flows

TagsCS 236

The big idea

For VAE’s, we couldn’t compute p(x)p(x) directly, which mean that we needed the ELBO objective to optimize it. Could we find some p(x)p(x) that is complicated yet easy to compute?

The key idea here is to find some invertible function f(z)=xf(z) = x such that p(x)=p(f(z))p(x) = p(f(z)). This is easy to compute and optimize over.

Background theory: transformation of distribution

When you transform a distribution, you need to scale it like this:

You can derive this through the CDF, but more intuitively:

The construction of normalizing flow

Functional Composition

If you had a bunch of invertible functions fif^i, you can compose them into a more complicated but still invertible function

And using change of variables and a property of determinants/jacobians, we get

Concretely, this zmz_m would be some data vector the same size as the final output. If this were an image, you would flatten it, or you might use a special convolutional function (more on this later).

Learning and using the model

Desiderata

Implementations of Normalizing Flows

NICE (additive coupling layers)

The idea: partition zz into two parts: z1:dz_{1:d} and zd+1:nz_{d+1:n}. Then, in the forward function…

As you can see, this follows the triangular formulation, and it is trivial to invert if you have access to mθm_\theta and xx: first, you get z1:dz_{1:d} for free, which will easily allow you to compute zd+1:nz_{d+1:n} from the xd+1:nx_{d+1:n} and mθm_\theta.

The jacobian is very convenient: the diagonal terms are all ones, so the volume is preserved.

This also means that we don’t actually have to compute the determinant: det(J)=1det(J) = 1.

We assemble the NICE model by doing arbitrary partitions (so we have good mixing). The final layer applies a rescaling transformation by multiplying by some constant.

Real-NVP

Real-NVP is only a small step up from NICE, but we get much better results.

As before, we can derive the inverse by computing z1:dz_{1:d} for free, and using it to reverse the scale/stretch operation and get zd+1:nz_{d+1:n}.

The jacobian is slightly more involved, but it’s still very simple

which means that the determinant is just

Unlike NICE, this is NOT a volume-preserving transformation. But it produces much better images.

Autoregression as Flow

So in creating our triangular Jacobian matrix, we had an interesting observation: this formulation looks really close to an autoregressive model! Let’s establish this fact a bit more

Masked Autoregressive Flow (MAF)

Suppose you had

where each p(xx<i)N(μ(x<i),σ(x<i))p(x|x_{<i})\sim N(\mu(x_{<i}), \sigma(x_{<i})). To sample, you can use a reparameterizaiton trick. First, you sample a vector of zz (gaussian). Then, you would autogressively compute each μ,σ\mu, \sigma, and scale the zz to form xx.

The flow interpretation is this: You take samples from zz and you map to xx using these invertible transformations parameterized by μ,α\mu, \alpha. This is really similar to real-NVP and NICE.

The inverse mapping, however, is really fast. If you had all xx, you can compute μ,α\mu, \alpha in parallel. This allows you to derive the zz quickly. Hmm..this is interesting. We have…

  1. Slow forward mapping from zxz→x because of autoregressive
  1. Fast inverse mapping from zxz→x because of parallelizable operations

Concretely, this means that this is fast to compute likelihoods and slow to sample from. It’s actually inverted from the best case scenario: it’s fine if it trains slower, but we want fast sampling. This brings us to an inversion of the structure

Inverse Autoregressive Flow (IAF)

The key idea is this: in forward flow, we created an autoregressive relationship in xx, which created the problem with sampling speed. What if we made an autoregressive relationship in zz? By using the sliding window, we still get the autoregressive dependence structure in xx. However, because we know all the zz’s ahead of time, we can compute the weights in parallel

The sampling process:

Now, the inverse mapping is slower. There is no free lunch. To derive zz, you need to invert the xx, and then use the solution to compute the autogressive parameters to get you the next zz. The inverse mapping process:

This is fast to sample from and slow to train. Side note: because the generations come from zz, it’s easy to compute the likelihood of the model’s own generations by caching the zz.

Getting the best of both worlds

We know that the forward inference model is fast to train but slow to infer. Can we use this forward inference model as a teacher to teach an inverse inference student model that is slower to learn but fast to infer?

You distill the teacher by matching the KL divergence of the student and the teacher. You sample from the student because you can easily compute likelihoods through caching. Computing likelihoods through the teacher is not a big deal because that’s the fast part.

Diffusion models as flow, flow as score functions

You can interpret (hand-wavy) diffusion models as a flow model, which also means that the flow models is an approximation of a score function.

Other Structures

Mintnet (Song et al. 2019)

Gaussianization flows (Meng et al. 2020)

The idea is to map the data such that the marginals are gaussian. Then, we rotate the distribution to make it non-gaussian again.

We do this by composing Φ1Fdata\Phi^{-1}\circ F_{data}, where Φ1\Phi^{-1} is the inverse gaussian CDF and FdataF_{data} is the data CDF. This is valid because composing any random variable with its CDF makes it into a uniform distribution. Composing a gaussian function with its CDF makes it uniform, so composing the uniform with the inverse CDF makes it gaussian.

If we do this multiple times, you can transform any data distribution into a gaussian because the true gaussian is rotationally invariant! It is a fixed point of this functional process.

And this formulates your fθ1f_\theta^{-1}, and there’s a nice trick: you can use the KL directly, because

And applying an invertible function doesn’t change the KL divergence.

This is easy to compute because there is a closed form for a KL between two gaussians.

Literature and other works