Tricks with marginals and complexity

TagsCS 228Inference

This is something that is a little confusing to me, so I thought I would dedicate some writing to it

Marginalizing only part of a set

Let’s say that you wanted P(X)P(X), which means that you need to marginalize out all Y’s from Y1,...YnY_1, ... Y_n. When doing this, don’t worry about where the X’s end up! Only worry about where the intermediate Y’s end up! Because here’s why: you will eventually be removing all Y1,...YNY_1, ... Y_N from the distribution. So any large cliques you make, you will have to deal with them eventually.

This idea is nice and intuitive: when marginalizing, the largest clique you form is the complexity it ends up with. and in fact, that’s all you need to know. A clique of size nn will take knk^n time to marginalize.

But let’s try to understand this at the point of maximum confusion.

Cliques vs chains

Cliques

Cliques of size nn are represented by factors with nn elements. This takes knk^n time to marginalize out.

When you want to marginalize ϕ(a,b,c)\phi(a, b, c), where each variable has kk values each, then you need k3k^3 operations. You can think about ϕ(a,b,c)\phi(a, b, c) as a table that requires k3k^3 elements to convey the whole data. It’s helpful to think that every time you make a clique, you must write out the tabular for it. (this intuition prevents you from backpropping too far)

Chains and dynamic programming

The key to understanding chains and dynamic programming is to separate the creation of the factor and the accession of the factor

Let’s look at the message passed to BB, f(B)f(B). This is a table of kk elements, but each element takes kk time (summed across AA) to make. Therefore, to construct the f(B)f(B), we need O(k2)O(k^2). But here is where the dynamic programming comes in. We cache this f(B)f(B) such that all future uses will be O(1).O(1).  When we create f(C)f(C), we need to marginalize f(B)ϕ(B,C)f(B)\phi(B, C). However, this f(B)f(B) is not the bottleneck. Rather, it’s the ϕ\phi that causes the next f(C)f(C) to also be constructed in O(k2)O(k^2) time. Memoization breaks the recursion early.

As such, these chains also obey the clique rule. The cliques are of size 2, so they must run in O(k2)O(k^2) time each.

Multiple ways of interpreting the same problem.

Special shapes

Square grids

To marginalize out a square grid of size nn, you start in the corner and move your way down an edge. However, you will realize that at the end, you end up with the next edge being an nn-clique (see this for yourself). Now, as you start plucking away this edge, you will see that this nn-clique stays the same (remove one, add one). Therefore, the complexity is O(n2kn)O(n^2 k^n), as you have to do the removing operation n2n^2 times with knk^n complexity each time.

Why every time? Well, you imagine making the new factor. This requires you to fill a table of size knk^n