Math Tricks

TagsBasicsEE 276

Proof tips

The Basics

Notation

Sometime we use the Pr{}Pr\{\} notation if using another PP is confusing. This is common for CDF, or if the likelihood is a random varaible

“what is the likelihood that the likelihood of a sample is below a certain value?”

Proof techniques

Limit behavior

Summations

Raising a summation to an exponent is the same as nesting exponents. This goes to a deeper truth that

(xf(x))(yf(y))=xyf(x)f(y)(\sum_x f(x))(\sum_y f(y)) = \sum_x\sum_yf(x)f(y)

which is intuitively true because multiplication does pairwise, and nested sum also does pairwise.

Fractions

CDF

If you have something like p(x>k)p(x > k), this is an integral over the PDF. If you’re dealing with discrete things, just take the summation (discrete there’s no notion of PDF).

Expectations

Marginalization

You can marginalize distributions, but you CAN’T marginalize something like H(X,Y)H(X, Y). This is because we already take the joint expectation and the entropy operator is not linear.

But…marginalization is your most common tool. If you want to calculate some p(x,y)p(x, y) but it’s easier for you to calculate p(x,yz)p(x, y | z), then do zp(x,yz)p(z)\sum_z p(x, y | z)p(z).

Chain rule

It can be easy to forget the actual chain rule. Remember: p(x,y)=p(xy)p(y)p(x, y) = p(x | y) p(y). The variable only appears on the non-conditional ONCE. It does not appear again, or else we run into problems.

“p”

Here’s also a point of confusion that sounds stupid but it happens all the time. The pp is not a specific function. So p(xy)p(x | y) isn’t the same as p(ab)p(a | b), etc. This is true even when the pp is specialized for some application. Probability is not a function.

Hard constraint inequality

If I have X+Y=CX + Y = C and I know that Y<BY < B, what can I say about XX? Well, I know that X=CYX = C - Y, and this is the tricky part: we know that Y>B-Y > -B, so C+(Y)>C+(B)C + (-Y) > C + (-B). So, X>CBX > C - B. The inquality is flipped. This is also true by vibes. A common mistake is the switch the location of the YY without flipping the sign of the inequality.

Advanced properties of RV’s

The formal definition of density

So for continuous RV’s , we have the notion of density f(x)f(x). When defining the density, we always do it in terms of the CDF FF.

f(x)=ddxF(x)f(x) = \frac{d}{dx}F(x)

so this is a great first step when you’re dealing with proofs with density. And of course, remember that

F(x)=p(k<x)F(x) = p(k < x)

and you can apply functions inside the pp, like p(ϕ(k)<x)p(k<ϕ1(x))p(\phi(k) < x) → p(k < \phi^{-1}(x)).

Information lines vs dependency lines

In things like Markov chains, we draw things like XYX → Y. In communication, we might draw something like X[P]YX -[P]- Y. What is the difference? Is there a difference?

Densities

The density p(s)p(s) means the likelihood of ss, which ALSO means that if you were to select ss at random, the likelihood of ss being this current ss has likelihood p(s)p(s).

Remember that p(s)p(s) is a value, as ss is scalar. Therefore, p(sv)p(s | v) is well-defined, but NOT p(sV)p(s | V). The conditioning must always be deterministic. On the other hand, p(Sv)p(S | v) is totally fine; it’s just another random variable.

What is a random variable??

A random variable XX is nothing more than a tuple containing p(x)p(x) distribution, and a range χ\chi. This xx in p(x)p(x) indexes into χ\chi, but that’s only an indexing property. So…as long as you have some set χ\chi and a valid probability distribution with the same cardinality, you’ve got yourself a random variable.

Convexity in a distribution

This is definition a mind trip, but certain things can have convexity or concavity WRT distributions. Don’t get too confused. A distribution is just a vector with L1 length 1. You can interpolate between two vectors, whose intermediate vectors are still L1 length 1. This interpolation is the secant line drawn between two distributions.

Linear function of a distribution

When we have something like xp(x)gx\sum_x p(x) g_x, we say that this is a linear function of p(x)p(x). Again, back to our view that a distribution is a vector. This is the same as doing a Hadamard product on the vector, which is linear.

Same RV, same parameters

Here’s a critical distinction. When we talk about some RV XX, it has an identity as well as parameters. If we set Y=XY = X, this means that every sample of YY is the same as XX.

However, there’s also the notion of sufficient statistics. If we have p(Y=y)=P(X=y)p(Y = y) = P(X = y), then we have the same distribution, but the identity of X,YX, Y are not the same. As a consequence, if you draw a sample from X, it may not be the same as the sample from Y.

Distributions over likelihoods

So XX is a random variable, and p(X)p(X) is also a random variable. This is because pp is just a function that takes in something and outputs a number between 0 and 1. Therefore, p(X)p(X) is feeding the sample back into its own likelihood fuction. This is perfectly valid, and in fact, this is exactly what entropy does! E[log1/p(X)]E[\log 1/p(X)].

Union and intersection bounds

This is basic stuff but it can be very useful

Remember that “not one” is equivalent to “all of them” (the laws of set negation).

Transforms of random

This is kinda tricky. If we have an invertible matrix GG and a uniform random vector uu, then GuGu is uniformly random. Why?

Well, if uu is uniform, think about the VECTOR not the scalars. This GG will map uu to vv in a one-to-one manner. So, the G is just reshuffling the random.

There are also special properties when we are in F2F_2. Namely, adding random uniform vectors make more random uniform vectors because addition is equivalent to bit flipping. This is not true for real-valued vectors. Adding uniform RV’s is equivalent to a random walk, and there are some deeper theories about this.

Splitting and moving in entropy, MI, etc

Making distributions from nothing at all

When you’re faced with something that feels very close to a distribution, feel free to multiply by the sum and divide by the sum. Division by the sum makes the thing into a distribution

Some tips for spotting wannabe distributions

  1. summations with indices (just divide by the summation to get a distribution). Often, you can make the summation into an expectation of something WRT a distribution .
  1. things with logs and inequalities (because with a distribution, you can use Jensen’s inequality)

Why do this? Well, things like Jensen’s inequalty and expectations don’t work without a distribution, so it’s beneficial to make one.

Writing distributions as charts

It’s tricky keeping track of what is what sometimes! Good labeling is always key.

Martingale

A martingale is a stochastic process where the expected next state is the same as the previous state. Or, in mathematical terms

E[st+1s1,...,st]=stE[s_{t+1} | s_1, ..., s_t] = s_t

A good example of a Martingale is a random walk. This is a bit beyond our paygrade, but Martingales have certain interesting properties that we can use to our advantage.

Change of variables

Suppose that you have y=q(x)y = q(x) where qq is invertible and differentiable. You eventually want to integrate along xx, but perhaps the formula is simple in yy space. So you start in yy space. For the sake of this problem I’ll just be doing a simple integral.

f(y)dy=f(q(x))dq(x)\int f(y)dy = \int f(q(x))dq(x)

Oh, but this is weird. How do we integrate over q(x)q(x)? Well, we know that dy=q(x)dxdy = q’(x)dx, so this just becomes

f(q(x))q(x)dx\int f(q(x))q'(x)dx

which hopefully is reminiscent of u-substitution. And then you just integrate over xx.

General strategy when dealing with change of variables: know how these are related on the non-derivative but also through the derivative dx,dydx, dy so that you can connect them.

Finding f

In the derivation above, we just assume that ff existed naturally. Now, typically for distributions, you have some p(x)p(x), some transform y=ϕ(x)y = \phi(x), and you might want to find the quanity h(y)h(y), which is easier to do in y-space first, but that you require you to find some g(x)g(x) equivalent of density in Y-space. Now this is actually not trivial. You start from first principles:

Note how you divide by this derivative term. This is a result of the density definition.