Expectation

Tags

Proof tips 🔨

Expectations

An expectation is defined as the following:

E[X]=xxP(x)=xP(x)dxE[X] = \sum_x x P(x) = \int x P(x)dx

These are some important properties

  1. E[a]=aE[a] = a
  1. E[af(X)+g(X)]=aE[f(X)]+E[g(X)]E[af(X) + g(X)] = aE[f(X)] + E[g(X)]
  1. E[1{X=k}]=P(X=k)E[1\{X = k\}] = P(X = k) (only for discrete)

Linearity of expectation with vectors

We know that expectation is linear. How does this translate to random vectors? Well, E[X]=[E[X1],E[X2]...]E[X] = [E[X_1], E[X_2]...]. In other words, you can apply this element-wise. Same goes for E[XXT]E[XX^T], which is a matrix.

As such, because the trace is also a linear expression, E[tr(X)]=tr(E[X])E[tr(X)] = tr(E[X]). This is easily proven using the summation definition of a trace.

What's a little bit more counterintuitive is that E[AX]=AE[X]E[AX] = AE[X]. Well, it makes sense for scalars, but at first glance this sounds weird. But actually it's less weird than it looks. You can think of E[AX]i=E[aiTX]=jE[Ai,jXj=jAi,jE[Xj]=aiTE[Xj]E[AX]_i = E[a_i^TX] = \sum_j E[A_{i,j}X_j = \sum_j A_{i, j} E[X_j] = a_i^T E[X_j], and if each element of the vector is expressed like this, then it is true that E[AX]=AE[X]E[AX] = AE[X].

A good rule of thumb is that anything that works with scalar multiplication has some sort of analogue in matrix multiplication.

Conditional Expectation

A conditional expectation is defined as follows:

E[Xy=k]=xp(xy=k)dxE[X|y =k] = \int x \cdot p(x | y = k)dx

So it's equivalent to

EXY[X]E_{X|Y}[X]

Expectation Laws

Law of conditional expectations (TOWER PROPERTY) 🔨

But we note that the conditional expectation E[XY]E[X | Y] (note that we aren’t specifying YY) is actually a random variable itself, over YY. This leads us to the law of total expectation , which states that

E[X]=E[E[XY]]E[X] = E[E[X | Y]]

This is sort of the expectation equivalent of marginalization. So you might expect to see this come up in latent variable models.

This is also helpful if you want to take the expectation over something like P(XY)P(X | Y), because it doesn’t make much sense to sample from x,yP(XY)x, y \sim P(X | Y)

Of course, this rule works for any sort of expectation, including vector and matrix expectations. So E[E[xxTz]]=E[xxT]E[E[xx^T | z]] = E[xx^T], for example.

Application: expanding sub-distributions 🧸

This can also be applied in reverse, and cleverly to expand a certain distribution. For example, EτP(τ)[V(st+1)]=EτP(τ)Est+1p(st+1st,at)[V(st+1)st,at]E_{\tau \sim P(\tau)}[V(s_{t+1})] = E_{\tau \sim P(\tau)}E_{s_{t+1}\sim p(s_{t+1} | s_t, a_t)}[V(s_{t+1}) | s_t, a_t] because st,ats_t, a_t is included in τ\tau.

In general, you can do this for any p(X)p(X) and xx, where xXx \subset X.

Essentially, the tower property allows you to separate out a distribution into its constituents, which can be helpful

Expectation of products

For any independent random variables X,YX, Y the following is true:

E[XY]=E[X]E[Y]E[XY] = E[X]E[Y]

However, it is not true that E[X2]=E[X]2E[X^2] = E[X]^2 because XX is not independent of itself at the same sample.

Estimators

An unbiased estimator of some parameter means that E[A^]=AE[\hat{A}] = A (and some convergence stuff surround this assumpation). Just because A^\hat{A} is an unbiased estimator doesn’t mean that f(A^)f(\hat{A}) will be an unbiased estimator of f(A)f(A). It’s true for linear functions but not generally true for all functions.

Expectation over multiple variables

When you have something like

Eτpπ(τ)[f(sk,ak)]E_{\tau \sim p^\pi(\tau)}[f(s_k, a_k)]

this is actually equivalent to take the expectation over the marginal

Esk,akpπ(sk,ak)[f(sk,ak)]E_{s_k, a_k \sim p^\pi(s_k, a_k)}[f(s_k, a_k)]

this shouldn’t be news to you, but it’s a very useful trick. Intuitively, the function on the inside ignores everything else, so you’re practically marginalizing those things away.