Basic Probability

Tags

Formalism & the basics

An outcome space Ω\Omega contains all possible outcomes. So if it were a dice, the set Ω\Omega would be six numbers. These numbers ωΩ\omega \in \Omega are known as outcomes. We define the event space to be the superset of Ω\Omega. Therefore, an event AFA \in F is a subset of the outcome space, which can be one outcome, or a set of outcomes. We can union events, and we can intersect events.

The probability measurement function P()P() is a function that satisfies these conditions

  1. P(A)0P(A) \geq 0 for all AFA \in F
  1. P(Ω)=1P(\Omega) = 1
  1. If A1,...AnA_1, ... A_n are disjoint (mutually exclusive), then P(iAi)=iP(Ai)P(\bigcup_iA_i) = \sum_iP(A_i)

From these axioms of probability, we derive the following properties

  1. ABP(A)P(B)A \subseteq B → P(A) \leq P(B) (duh)
  1. P(AB)min(P(A),P(B))P(A \cap B) \leq \min(P(A), P(B)) (another duh)
  1. P(AB)P(A)+P(B)P(A\cup B) \leq P(A) + P(B)
  1. P(ΩA)=1P(A)P(\Omega - A) = 1 - P(A) (law of complements)

Two views of probability

The frequentist believes probabilty as a frequency of events. This is good if you’re doing multiple tries of something, but there are cases, like weather predictions, where there will only be one outcome.

In this case, the better belief is this: the probability is the subjective degrees of belief. If P=1P = 1, then we are very sure that it will happen. If P=0.5P = 0.5, then we are only somewhat sure.

Union and disjoints

Unioning events α,β\alpha, \beta touches on the law of summations

P(αβ)=P(α)+P(β)P(αβ)P(\alpha \cup \beta) = P(\alpha) + P(\beta) - P(\alpha \cap \beta)

If P(αβ)=0P(\alpha \cap \beta) = 0, we call these two events mutually exclusive.

Intersection and independence

Intersecting events touches on the chain rule

P(αβ)=P(α)P(βα)=P(β)P(αβ)P(\alpha \cap \beta) = P(\alpha)P(\beta| \alpha) = P(\beta)P(\alpha|\beta)

We define α\alpha to be independent of β\beta if these equivalent statements are true

  1. P(βα)=P(β)P(\beta | \alpha) = P(\beta)
  1. P(αβ)=P(α)P(β)P(\alpha \cap \beta) = P(\alpha)P(\beta)

We can generalize the idea of two intersections into the chain rule, which states that

P(S1S2,...,Sk)=P(S1)P(S2S1)P(S3S2S1),...P(SkS1S2...Sk1)P(S_1 \cap S_2, ... , S_k)= P(S_1)P(S_2 | S_1)P(S_3 | S_2 \cap S_1),...P(S_k|S_1\cap S_2...S_{k-1})
💡
It can be easy to forget the actual chain rule. Remember: p(x,y)=p(xy)p(y)p(x, y) = p(x | y) p(y). The variable only appears on the non-conditional ONCE. It does not appear again on the left hand side!

Conditional distributions

We can use the standard formula to get

pYX(yx)=pXY(x,y)pX(x)p_{Y|X}(y|x) = \frac{p_{XY}(x,y)}{p_X(x)}

and in the continuous case it’s just a ratio of functions

Bayes rule applies with continuous and discrete random variable PMF’s, as well as the laws of independence

Tabular Distributions

Writing distributions as charts

It’s tricky keeping track of what is what sometimes! Good labeling is always key.

Tabular computation

From a table, you can compute things like P(A=0,B=1,...)P(A = 0, B = 1, ...). To remove a variable, you have to marginalize it out by summing across the joint probability. To condition a variable, you got to isolate all the rows with these conditions, sum the probabilities to get a normalizing factor, and then you can compute something like P(A=0B=1)P(A = 0 | B = 1). As another note: you can’t compute P(A)P(A) as a function because the table is just for look-up. Tabular computation requires you to plug in values for the variables.

Remember that for independence, P(A=0B=1)=P(A=0B=0)=P(A=0)P(A = 0 | B = 1) = P(A = 0 | B = 0) = P(A = 0). However, if P(A=0,B=1)P(A=0,B=0)P(A = 0, B = 1) \neq P(A = 0, B = 0), this does NOT show dependence unless P(B=1)=P(B=0)P(B = 1) = P(B = 0).