Tricks with marginals and complexity

Tags	CS 228Inference

This is something that is a little confusing to me, so I thought I would dedicate some writing to it

Marginalizing only part of a set

Let’s say that you wanted $P(X)$ , which means that you need to marginalize out all Y’s from $Y_1, ... Y_n$ . When doing this, don’t worry about where the X’s end up! Only worry about where the intermediate Y’s end up! Because here’s why: you will eventually be removing all $Y_1, ... Y_N$ from the distribution. So any large cliques you make, you will have to deal with them eventually.

This idea is nice and intuitive: when marginalizing, the largest clique you form is the complexity it ends up with. and in fact, that’s all you need to know. A clique of size $n$ will take $k^n$ time to marginalize.

But let’s try to understand this at the point of maximum confusion.

Cliques vs chains

Cliques

Cliques of size $n$ are represented by factors with $n$ elements. This takes $k^n$ time to marginalize out.

When you want to marginalize $\phi(a, b, c)$ , where each variable has $k$ values each, then you need $k^3$ operations. You can think about $\phi(a, b, c)$ as a table that requires $k^3$ elements to convey the whole data. It’s helpful to think that every time you make a clique, you must write out the tabular for it. (this intuition prevents you from backpropping too far)

Chains and dynamic programming

The key to understanding chains and dynamic programming is to separate the creation of the factor and the accession of the factor

Let’s look at the message passed to $B$ , $f(B)$ . This is a table of $k$ elements, but each element takes $k$ time (summed across $A$ ) to make. Therefore, to construct the $f(B)$ , we need $O(k^2)$ . But here is where the dynamic programming comes in. We cache this $f(B)$ such that all future uses will be $O(1).$ When we create $f(C)$ , we need to marginalize $f(B)\phi(B, C)$ . However, this $f(B)$ is not the bottleneck. Rather, it’s the $\phi$ that causes the next $f(C)$ to also be constructed in $O(k^2)$ time. Memoization breaks the recursion early.

As such, these chains also obey the clique rule. The cliques are of size 2, so they must run in $O(k^2)$ time each.

Multiple ways of interpreting the same problem.

Special shapes

Square grids

To marginalize out a square grid of size $n$ , you start in the corner and move your way down an edge. However, you will realize that at the end, you end up with the next edge being an $n$ -clique (see this for yourself). Now, as you start plucking away this edge, you will see that this $n$ -clique stays the same (remove one, add one). Therefore, the complexity is $O(n^2 k^n)$ , as you have to do the removing operation $n^2$ times with $k^n$ complexity each time.

Why every time? Well, you imagine making the new factor. This requires you to fill a table of size $k^n$