# Expectation

Tags |
---|

# Proof tips 🔨

- remember that expectation is an integral. You can’t switch any non-linear function with an expectation (although Jensen’s can give you an inequality)

- If $X, Y$ are independent, then $E[XY] = E[X]E[Y]$.

- Expectations are linear. This is also true for matrices: $XE[V] = E[XV]$. It can be easy to miss!

- You can add an expectation out of nothing at all $f(x) = E_{p(y)}[f(x)]$. Look for these to turn functions into expectations

- Can write marginalization process as expectation: $\sum_z p(x, z) = \sum_z p(z)p(x | z) = E_{z\sim p(z)} p(x |z)$

- Expectation is linear, like gradient and integral

- log-probabilities allow you to factor like sums

- Expectation out of summation if there’s a distribution. You can also use a surrogate distribution: $\sum p(x, y) = \sum q(y) p(x,y)/q(y) = E_q[p(x,y)/q(y)]$.

- Tower property: $E_{p(x,y)}[f(x, y)] = E_{x\sim p(x)} E_{y\sim p(y|x)}[f(x,y)]$. More formally, it’s actually $E_{p(x,y)}[f(x, y)] = E_{x\sim p(x)}[ E_{y\sim p(y|x)}[f(x,y) | x]]$, but in most cases this is implied.

# Expectations

An expectation is defined as the following:

These are some important properties

- $E[a] = a$

- $E[af(X) + g(X)] = aE[f(X)] + E[g(X)]$

- $E[1\{X = k\}] = P(X = k)$ (only for discrete)

## Linearity of expectation with vectors

We know that expectation is linear. How does this translate to random vectors? Well, $E[X] = [E[X_1], E[X_2]...]$. In other words, you can apply this element-wise. Same goes for $E[XX^T]$, which is a matrix.

As such, because the trace is also a linear expression, $E[tr(X)] = tr(E[X])$. This is easily proven using the summation definition of a trace.

What's a little bit more counterintuitive is that $E[AX] = AE[X]$. Well, it makes sense for scalars, but at first glance this sounds weird. But actually it's less weird than it looks. You can think of $E[AX]_i = E[a_i^TX] = \sum_j E[A_{i,j}X_j = \sum_j A_{i, j} E[X_j] = a_i^T E[X_j]$, and if each element of the vector is expressed like this, then it is true that $E[AX] = AE[X]$.

## Conditional Expectation

A conditional expectation is defined as follows:

So it's equivalent to

# Expectation Laws

## Law of conditional expectations (TOWER PROPERTY) 🔨

But we note that the conditional expectation $E[X | Y]$ (note that we aren’t specifying $Y$) is actually a random variable itself, over $Y$. This leads us to the `law of total expectation`

, which states that

This is sort of the expectation equivalent of marginalization. So you might expect to see this come up in latent variable models.

This is also helpful if you want to take the expectation over something like $P(X | Y)$, because it doesn’t make much sense to sample from $x, y \sim P(X | Y)$

## Proof

$E[E[X | Y]] = \sum_y P(Y = y) E[X | Y = y]$

And then we expand the definition of $E[X | Y = y]$ as

$E[E[X | Y]] = \sum_y P(y) \sum_x P(x | Y = y) x$

And we can rearrange the summations to get

Pretty neat trick!

Of course, this rule works for any sort of expectation, including vector and matrix expectations. So $E[E[xx^T | z]] = E[xx^T]$, for example.

## Application: expanding sub-distributions 🧸

This can also be applied in reverse, and cleverly to expand a certain distribution. For example, $E_{\tau \sim P(\tau)}[V(s_{t+1})] = E_{\tau \sim P(\tau)}E_{s_{t+1}\sim p(s_{t+1} | s_t, a_t)}[V(s_{t+1}) | s_t, a_t]$ because $s_t, a_t$ is included in $\tau$.

In general, you can do this for any $p(X)$ and $x$, where $x \subset X$.

## Expectation of products

For any independent random variables $X, Y$ the following is true:

However, it is not true that $E[X^2] = E[X]^2$ because $X$ is not independent of itself at the same sample.

# Estimators

An `unbiased`

`estimator`

of some parameter means that $E[\hat{A}] = A$ (and some convergence stuff surround this assumpation). Just because $\hat{A}$ is an unbiased estimator doesn’t mean that $f(\hat{A})$ will be an unbiased estimator of $f(A)$. It’s true for linear functions but not generally true for all functions.

# Expectation over multiple variables

When you have something like

this is actually equivalent to take the expectation over the marginal

this shouldn’t be news to you, but it’s a very useful trick. Intuitively, the function on the inside ignores everything else, so you’re practically marginalizing those things away.