Sampling vs Likelihood

Tags	AsideCS 228

What can we do?

If we assume a densely connected bayes network and you know each CPD, there are certain things we can compute tractably and others that we can’t. Let’s explore that.

Let’s assume that the graph is separated out into $X$ and $Y$ . We would observe $Y$ and try a few things.

$p(x, y)$

We can calculate it because we can factorize the graph and we know the conditional probability distributions. Therefore, the probability value can be calculated in linear time

If we are asked to sample from the whole joint distribution, this is also feasible using a topological sort and propagating the signal down the graph.

However, we can’t sample from it if we have already observed $y$ . If we tried to do $x \sim p(X, y)$ , this would be the same as sampling from $p(X | y)$ because $p(X, y) = p(X | y) p(y)$ , and the $p(y)$ doesn’t impact the $x$ . Let’s look at the next section to see what the conditional probability is like.

$p(X = x | y)$

We can’t calculate this easily because it involves $p(X = x | y) = p(X = x, y) / p(y)$ and $p(y)$ can’t be derived in tractable time, as it requires summing across all values of $X$ .

We can’t sample easily from this, because when we observe variables, there exists V-structures that tie previously independent variables together. Therefore, we can’t simply just put our thumb on $y$ in the graph and calculate the rest. We would have to recompute many CPDs. The general rule of thumb is that if you can’t calculate the probability, then you can’t sample.

There is one particular case in which both of these are possible. If the bayes net were a tree (i.e. only one parent), then there can’t exist any V-structures. As such, we can split up the tree into multiple independent subtrees (separated by the observed variables). You can sample from these subtrees easily, and you can also easily compute likelihoods.

Monte Carlo expectation

There are a few ways of computing an expectation, and it might be confusing at first.

We have four equivalent (or approximations) of the same statement

$\sum_x p(x)f(x)$

$E_{x\sim p(x)}[f(x)]$

$\frac{1}{M}\sum_{m}f(x[m])$ when $x$ is sampled from $p$

$\frac{1}{M}\sum_x M[x]f(x)$ when $x$ is sampled from $p$ where $M$ is a count function

The first and second statements are equivalent by definition.

The second and third are equivalent by the Monte Carlo approximation. The expectation is a weighted sum of values across a distribution. That’s kind of like an outsider’s perspective. The Monte Carlo assumes that you have direct data from the distribution. That’s like an insider’s perspective. From this, you can just take the average because the weighting is implicit; it comes from the sampling.

The third and fourth are equivalent because they just express different summations. The first summation is over data points, while the second summation is over values. You can imagine grouping the sumation by data points by the value of those data points, which gets you the fourth equation.

What can we do?

@import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.9/katex.min.css')p(x,y)p(x, y)p(x,y)﻿

@import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.9/katex.min.css')p(X=x∣y)p(X = x | y)p(X=x∣y)﻿

Monte Carlo expectation

$p(x, y)$

$p(X = x | y)$