Math refresher

TagsCS 236

Advice

Math review, proof tips

Jensen’s inequality: E[log(f(x))]logE[f(x)]E[\log(f(x))]\leq \log E[f(x)], opposite way around for convex functions. Useful because usually putting the log inside can create a simpler expression. This is your friend.

Mapping between distributions: g(Y)=f(h1(Y))Yh1(y)g(Y) = f(h^{-1}(Y))|\nabla_Y h^{-1}(y)|, where ff is potentially a simpler function, hh is a mapping from simple to complicated, and YY is the complicated distribution

Fenchel conjugate: the convex conjugate of a function: f(t)=supu(utf(u))f^*(t) = \sup_u (ut - f(u)).

F-divergence: Anything of the form Df(p,q)=Eq[f(p/q)]D_f(p, q) = E_q[f(p/q)]. The ff is lower-semicontinuous, i.e. points below are continuous. F-divergences are alwayas greater than zero.

KL divergence: Ep[logp/q]E_p[\log p/q]

Bayes rule (duh): p(xy)=p(yx)p(x)p(yx)p(x)p(x | y) = \frac{p(y|x)p(x)}{\sum p(y|x)p(x)}

Subtract a KL trick: you know that KL is never negative, so if you can add / subtract, you can provide a bound.

Showing minimization, etc: see if you can rearrange into KL divergence or some other f-divergence. To do this, start with the difference (i.e. the optimal should be 0).

Distribution tricks

Expectation

Some Basics

Jensen’s inequality

Lower bound computation

Gaussian formula

Expectation properties

Divergences

The KL divergence of two same-variance gaussians is simply the scaled square distance of the means

DKL(N(θ,ϵ)N(θ0,ϵ))=(θθ0)2ϵD_{KL}(N(\theta, \epsilon) || N(\theta_0, \epsilon)) = \frac{(\theta - \theta_0)^2}{\epsilon}

Estimators

An unbiased estimator of some parameter means that E[A^]=AE[\hat{A}] = A (and some convergence stuff surround this assumpation). Just because A^\hat{A} is an unbiased estimator doesn’t mean that f(A^)f(\hat{A}) will be an unbiased estimator of f(A)f(A). It’s true for linear functions but not generally true for all functions.

Mapping between distributions / functions

This is a classic problem and it’s very tricky to understand. Here’s the setup: you have some known f(x)f(x), some invertible mapping function h(x):XYh(x) : X → Y, and you want to find g(y)g(y) as a function of f(x)f(x).

Non-distribution setup

Without the question of distribution, this is actually really simple.

g(y)=f(h1(y))g(y) = f(h^{-1}(y))

You take your yy and you map it into the domain that works for ff. No biggie.

Distribution-based setup

Why doesn’t this work if ff is a density function? It has to do with density. Imagine the x,yx, y were rubber bands. if you have to stretch or shrink the rubber band in your mapping of hh, then you run into the problem of having the same density with different scales, leading to a change in the integral.

More formally, if dxdydx \neq dy, we’ve got a problem that we can solve with standard change of variable techniques. In calculus, we implemented these techniques when we mapped between two domains (u-substitution) but we wanted to keep the integral equivalent.

  1. Start with f(x)dxf(x)dx
  1. Find h1(y)h^{-1}(y)
  1. Compute g(y)=f(h1(y))dh1(y)g(y) = f(h^{-1}(y))dh^{-1}(y) in terms of yy, getting you the final answer

From this, we can derive a general rule (using the derivative of the inverse function general form):

g(y)=f(h1(y))h(h1(y))g(y) = \frac{f(h^{-1}(y))}{h'(h^{-1}(y))}

Generalization to vector-distributions

For vector distributions, we are dealing with a multi-variable transformation. This transformation, like the one-dimensional case, must be invertible. For a matrix, it must be full-rank.

Here, instead of worrying about dxdx vs dydy, we are worried about mapping dvdv onto dvdv’, where vv is some unit element. As we’ve learned in multivariate calculus, the convenient notion is the Jacobian determinant, i.e.

g(Y)=f(h1(Y))Yh1(y)g(Y) = f(h^{-1}(Y))|\nabla_Y h^{-1}(y)|

Note how the forms are super similar; we just replace the derivative with the determinant of the jacobian. Recall that the jacobian describes how the unit-area stretches or shrinks with the transformation, so this is intuitive. So you’re almost “paying tax” by “unshrinking” from yy to xx before you apply the function.

Intuition

Let’s jump directly to the higher dimension.

  1. Yh1(y)|\nabla_Y h^{-1}(y)| we know represents how much the space stretches as we apply the transformation from YXY→X.
  1. f(h1(Y))f(h^{-1}(Y)) is the density in X-space. If the Y→X increases, everything is naturally divided by Yh1(y)|\nabla_Y h^{-1}(y)| as we go into X-land. Therefore, to reverse this process, we multiply the value byYh1(y)|\nabla_Y h^{-1}(y)|
  1. If Y→X decreases, the same logic applies, just with a Yh1(y)|\nabla_Y h^{-1}(y)| that is less than 1

You can think of a transformation as naturally dividing by the “stretch” factor Yh1(y)|\nabla_Y h^{-1}(y)|, and you need to compensate as you map it back.

Worked example

If we have pX(x)p_X(x) known and we have a good function f:XYf: X\rightarrow Y, then we can compute pY(Y)=PX(f1(Y))f1(Y)p_Y(Y) = P_X(f^{-1}(Y)) | f^{-1}’(Y)|. Often, we get f1f^{-1} at the start, which is fine. So, you just scale your values by f1(Y)|f^{-1’}(Y)|. If the operation is linear, this is a constant

💡
Make sure to account for DIMENSION. This is only true for one-dimension. For larger dimensions, you need to consider how the volume changes.

Convexity

Fenchel Conjugate

Any function has a convex conjugate known as the Fenchel conjugate

This is a generalization of Lagrangian duality, but intuitively, this is a convex hull to the function.

This ff^* is convex and lower semi-continuous. We can take more Fenchel conjugates, i.e. you can take the second Fenchel Conjugate

Properties of Fenchel Conjugate

We have fff^{**} \leq f. If ff is convex and lower semi-continuous, f=ff^{**} = f.

F-divergences

Given two densities, we define a general f-divergence as

where ff is any convex, lower-semicontinuous function with f(1)=0f(1) = 0. Lower-semicontinuous basically means that around x0x_0, every point that is below must be continuous. Every point above is fine. This looks like the following diagram:

Properties of F-divergences

Always greater than zero

Examples of F-divergences

There are so many types of F-divergences, with a common type being KL divergence, where f=uloguf = u\log u (careful! It’s not log\log because it’s plogp/qp \log p/q, not qlogp/qq \log p/q. ) Total Variation is also a common one, where the f=u1f = |u-1|