Sampling vs Likelihood

TagsAsideCS 228

What can we do?

If we assume a densely connected bayes network and you know each CPD, there are certain things we can compute tractably and others that we can’t. Let’s explore that.

Let’s assume that the graph is separated out into XX and YY. We would observe YY and try a few things.

p(x,y)p(x, y)

We can calculate it because we can factorize the graph and we know the conditional probability distributions. Therefore, the probability value can be calculated in linear time

If we are asked to sample from the whole joint distribution, this is also feasible using a topological sort and propagating the signal down the graph.

However, we can’t sample from it if we have already observed yy. If we tried to do xp(X,y)x \sim p(X, y), this would be the same as sampling from p(Xy)p(X | y) because p(X,y)=p(Xy)p(y)p(X, y) = p(X | y) p(y), and the p(y)p(y) doesn’t impact the xx. Let’s look at the next section to see what the conditional probability is like.

p(X=xy)p(X = x | y)

We can’t calculate this easily because it involves p(X=xy)=p(X=x,y)/p(y)p(X = x | y) = p(X = x, y) / p(y) and p(y)p(y) can’t be derived in tractable time, as it requires summing across all values of XX.

We can’t sample easily from this, because when we observe variables, there exists V-structures that tie previously independent variables together. Therefore, we can’t simply just put our thumb on yy in the graph and calculate the rest. We would have to recompute many CPDs. The general rule of thumb is that if you can’t calculate the probability, then you can’t sample.

There is one particular case in which both of these are possible. If the bayes net were a tree (i.e. only one parent), then there can’t exist any V-structures. As such, we can split up the tree into multiple independent subtrees (separated by the observed variables). You can sample from these subtrees easily, and you can also easily compute likelihoods.

Monte Carlo expectation

There are a few ways of computing an expectation, and it might be confusing at first.

We have four equivalent (or approximations) of the same statement

  1. xp(x)f(x)\sum_x p(x)f(x)
  1. Exp(x)[f(x)]E_{x\sim p(x)}[f(x)]
  1. 1Mmf(x[m])\frac{1}{M}\sum_{m}f(x[m]) when xx is sampled from pp
  1. 1MxM[x]f(x)\frac{1}{M}\sum_x M[x]f(x) when xx is sampled from pp where MM is a count function

The first and second statements are equivalent by definition.

The second and third are equivalent by the Monte Carlo approximation. The expectation is a weighted sum of values across a distribution. That’s kind of like an outsider’s perspective. The Monte Carlo assumes that you have direct data from the distribution. That’s like an insider’s perspective. From this, you can just take the average because the weighting is implicit; it comes from the sampling.

The third and fourth are equivalent because they just express different summations. The first summation is over data points, while the second summation is over values. You can imagine grouping the sumation by data points by the value of those data points, which gets you the fourth equation.