Concentration Inequalities ⭐

Tags

Convergence of Random Variables

What is a sequence of random variables? Well, maybe think about dice that come out from a dice factory. Each dice is an RV, and there may be a pattern in how the dice behave.

There are different notions of convergence.

in distribution means that we converge to the same sufficient statistics

in probability means that for every $\epsilon > 0$ , we have $p(|X_n - X| > \epsilon) → 0$ . Intuitively, the probability of an “unusual” outcome becomes more rare as sequence progresses. Close to the idea of a hoeffding’s inequality, useful for the law of large numbers, etc.
1. with this inequaltiy, you can say that we asymptotically bound this property within an epsilon ball
1. Formally: for every $\delta$ , there exists some $N(\delta,\epsilon)$ such that for all $n > N(\delta,\epsilon)$ , we have $p(|X_n - X| > \epsilon) < \delta$ . Note how this $N$ depends on both the delta and the epsilon.

in mean square if $E[X_n - X]^2 → 0$ as $n → \infty$ .

with probability 1 (almost surely) if $P(\lim X_n = X) = 1$ .
1. this is different from 2) because we take the limit inside the probability.

Notation-wise, we use $X_n \overset d\rightarrow X$ for distribution, $X_n \overset p \rightarrow X$ for probability, etc.

Understanding what convergence means

Distributional convergence makes sense. But how can the other types of convergence happen if these are random variables? Well, there are two common cases:

The $X_n = f(X)$ DIRECTLY, and $f_n → 1$ as $n→\infty$ . In other words, there’s some sort of coupling between the RV’s.

The $X$ is a constant. Here, we get some standard things like the central limit theorem, where $X = E[X]$ (overload of notation), and $X_n$ is the mean of the current samples.

Ways of convergence, constraint (another way of looking at this)

Asymptotic bounds: this guarantees that some $X → Y$ as $n→ \infty$ . You might get a high probability bound of $o(f(n))$ , which gives a bit more structure asympotically

Example: law of large numbers (loose)

Pros: often the most intuitive, easy to work with

Cons: may not say too much about low $n$

High probability bounds: this guarantees that $|X - Y| > \epsilon$ chance is $\delta$ . you get an interplay between $\epsilon, \delta, n$ . You can set two of them and the third one will be derived. Usually $\delta$ is written as a function of $n$ , but you can easily invert

Example: all tail bounds, hoeffding’s inequality

Pros: strict definition, works on every $n$

General Tail Bounds

Tail bounds are important because they limit how large your tail is in a distribution, which naturally has implications for learning algorithms that could potentially exclude the tail of a distribution

Markov’s Inequality

We can show that for any non-negative random variable $X$ and $t > 0$ , we have

P(X \geq t) \leq \frac{E[X]}{t}

Intuitively, as we get further away from the expected value, the likelihood decreases. This is mathematically super helpful as it relates probabilities to expectations, and we will use it to build up law of large numbers.

Proof (integral definition of probability and expectation)

Chebyshev’s Inequality

We can show very easily that

P(|Y - \mu| > \epsilon) \leq \frac{\sigma^2}{\epsilon^2}

where $\sigma^2, \mu$ is the variance and mean of $Y$ , respectively. This brings us closer to the idea of hoeffding’s inequaltiy and the law of large numbers.

Proof (Markov’s inequality)
Let $X = (Y - \mu)^2$

Upshot: to show the law of large numbers for any stochastic process, it is sufficient to show that the variance of the sample goes to zero.

General Tail Bound

Claim: if $X$ is real, then for any $t\geq 0$ ,

P(X > c) \leq E[\exp(tX -tc)]

Proof (expansion into integrals)

Gaussian tail bound

Claim: if $Z$ is gaussian with mean 0 and variance 1, then

P(Z > |c|) \leq \exp(-c^2/2)

Proof (general tail bound + moment generating function)
and then let $t = c$ , getting us
If you want this to work with non-unitary variance, just plug in the non-unitary variance MGF of the gaussian and set $t = \frac{1}{\sigma^2}c$ .

This gets us a stronger variant of Chebyshev’s inequality, and it is food for thought.

Hoeffding’s inequality ⭐

Hoeffding’s inequality in the general form is as follows: given $X_1, … X_n$ and $a_i < X_i < b_i$ , we have

We often find a looser bound by replacing $b_i - a_i$ term with $\sigma^2$ , because the range is always less than the variance.

💡

we drop the factor of

2

outside of the exponential if there is no absolute value inside the probability

Common forms include

$2\exp(-2n\epsilon^2)$ if you are bounded in $[0, 1]$

$2\exp(-2\epsilon^2/n)$ if you are bounded in [0,1] and you are dealing with SUMS, not averages.

Proof (lots of algebra + Markov’s inequality)
Start with $\bar{Z} = \sum(z_i - E[Z_i])$ . If $\bar{Z} > t$ , then $\exp(s\bar{Z}) \geq \exp(st)$ for every $s \geq 0$ . Markov’s inequality therefore states that
$P(\bar{Z} > t) = P(e^{s\bar{Z}} \geq e^{st}) \leq \min_s\frac{E[e^{s\bar{z}}]}{e^{st}}$
After expanding back into $z_i$ , you get that this is equal to $\prod E[e^{s(z_i - E[z_i]}]$ .
With Taylor expansion, you get that $E[e^{sz_i}] \leq e^s - s$ . We know that $e^x - x \leq e^{x^2}$ , which means that $\prod E[e^{sz_i}] \leq e^{ns^2}$ .
The minimizer of $\min_s\frac{E[e^{s\bar{z}}]}{e^{st}} = e^{ns^2}/e^{st}$ is just $s = t/2n$ . Plugging this back in, we get
$P(\bar{z}>t) \leq \exp(n(t/2n)^2 - t^2/2n) \leq \exp(-t^2/4n)$
Now, this isn’t the true inequality. We got a weaker result with the $t^2/4n$ instead of $2t^2/n$ . This is a minor technicality and we can strenthen with better argument, but this is the general idea.

Uses of Hoeffding’s inequality

Anytime you see some sample sum or mean, you can use Hoeffding’s inequality to establish some bounds on how far we stray from the mean

Lemma: Sample variances

The variance of $\frac{1}{n}\sum Z_i$ is $\sigma^2 / n$

Proof (from definitions)

McDiarmid’s Inequality

This states that for any IID random variables $x_1, …, x_n$ and any function $f$ that has bounded differences (i.e. substituting the ith coordinate $f(x_1, …, x_n)$ changes $f$ by at most a finite $c$ . Then…

P(|f(x_1, ..., x_n) - E[f(x_1, ..., x_n)]| > t) \leq \exp(-2t^2/n)

As usual, the alternate (average) form is

P(|\bar{f}(x_1, ..., x_n) - E[\bar{f}(x_1, ..., x_n)]| > t) \leq \exp(-2nt^2)

Now, if $f$ is the summation, we get Hoeffding’s inequality.

The proof comes from Martingales, and so we will omit it here. Just know that this is a powerful inequality that bounds samples to an expectation.

CLT (Central Limit Theorem)

The CLT leverages the law of large numbers to make the claim that all RV averages are gaussian in nature.

If $X_1, …, X_n$ are IID random variables from distribution $X$ that has covariance $\Sigma$ . Let $\bar{X} = \frac{1}{n} \sum X_i$ , then

$\bar{X}\overset{p}\rightarrow E[X]$ (this is the weak law of large numbers)

$\sqrt{n}(\bar{X} - E[X]) \overset{d}→N(0, \Sigma)$
1. If you were to move this around, you’d get $(\bar{X} - E[X]) \overset{d}→N(0, \Sigma / n)$ which makes sense. You’re narrowing around the mean, which is a result of the law of large numbers

The CLT is derivable from Hoeffding’s inequality

Delta Method

Using point (2), we can apply the Delta Method and assert that for any non-zero derivative function $h$ , we have

This is a very helpful extension of the CLT.

Law of large numbers

The laws of large numbers can be derived from Hoeffding’s inequality: we see that the sample mean gets closer to the true mean.

The actual Law ⭐

Both the weak and strong forms of the law of large numbers state that the average of samples (denoted by $X_n$ ) converges to $E[X]$ .

The weak form states that the convergence happens in probability. Formally, this means that

P\left(\left|\frac{1}{n}\sum_nX_n - E[X]\right| > \epsilon\right) \rightarrow 0

Proof (Markov’s inequality and sample variances)
Let $X = (\frac{1}{n}\sum X_n - E[X])^2$
which gets us
$P(\left|\frac{1}{n}\sum_nX_n - E[X]\right| > \epsilon) \leq \frac{\sigma^2}{n\epsilon^2}$
and as we can see, the limit of the RHS pinches to 0, and that is what we wanted to show.

The strong form states that it happens almost surely. Proof is omitted for now.