Inference

Tags

Marginalization

You can expand a distribution by marginalizing:

p(y)=p(yx)dxp(y) = \int p(y|x) dx

Sum is always 1 🔨

Remember that p(x)dx=1\int p(x) dx = 1. Taking the derivative of p(x)dx\int p(x)dx WRT any variable (say, an internal parameter), is always 00.

The art of marginalization 🔨

When you need to find something like p(ab,c)p(a | b, c), your first thought should go to Bayes rule, becuase it allows you to express it as

p(ab,c)=p(a,b,c)p(b,c)p(a| b, c) = \frac{p(a, b, c)}{p(b, c)}

And the unconditional probabilities can be expressed as marginalization of the joint probability, like

p(ab,c)=d,ep(a,b,c,d,e)a,d,ep(a,b,c,d,e)p(a| b, c) = \frac{\sum_{d, e}p(a, b, c, d, e)}{\sum_{a, d, e}p(a, b, c, d, e)}

And the cool part is that you typically factorize the joint probability using a Bayesian network definition, and this will allow some nice simplification. Things typically collapse

IMPORTANT: marginalization only occurs on the LEFT HAND SIDE of a conditional. To get rid of the right hand side of a conditional, you must perform a weighted sum with the probability of that thing happening (essentially using chain rule to move it over).

Marginalize, unmarginalize 🔨

You can move things around by marginalizing. For example

p(tx)=p(yt=1,x)p(t=1)+p(yt=0,x)p(t=0)p(t|x) = p(y|t=1, x)p(t = 1) + p(y|t = 0, x)p(t = 0)

And now that you have expanded it, you might have more options.

You can marginalize distributions, but you CAN’T marginalize something like H(X,Y)H(X, Y). This is because we already take the joint expectation and the entropy operator is not linear.

Marginal vs Conditional

Marginal is squashing, while conditional is slicing.

They both reduce dimensionality, but it's important to note that they are very, very different from each other. Marginalization means "regardless of this event happening, what's the distribution of another event?". Conditioning means "given that this event happens in this configuration, what's the distribution of another event?"

Bayes rule

Bayes rule is derived from the chain rule / intersection discussion:

P(αβ)=P(βα)P(α)P(β)P(\alpha | \beta) = \frac{P(\beta | \alpha)P(\alpha)}{P(\beta)}

which you can generalize to

P(αβ,γ)=P(βα,γ)P(αγ)P(βγ)P(\alpha | \beta, \gamma) = \frac{P(\beta | \alpha, \gamma)P(\alpha | \gamma)}{P(\beta| \gamma)}

where γ\gamma can be anything

Now, in PGM applications, the denominator or other parts of the equation may be intractable to calculate. This is where we need to do some approximations.

The quick trick with bayes

When you have something like P(CA,B)P(C | A, B) and you need to flip the A and the C, think automatically that you need a joint distribution of P(C,AB)P(C, A | B).

Conversely, when you have a joint distribution P(C,AB)P(C, A | B) and you divide it by something on the left side of the conditional, like P(C)P(C), you can imagine this “bumping” the CC to the right, getting P(AC,B)P(A | C, B)