Concentration Inequalities

TagsBackground

Ways of convergence, constraint

Asymptotic bounds: this guarantees that some XYX → Y as nn→ \infty . You might get a high probability bound of o(f(n))o(f(n)), which gives a bit more structure asympotically

High probability bounds: this guarantees that XY>ϵ|X - Y| > \epsilon chance is δ\delta. you get an interplay between ϵ,δ,n\epsilon, \delta, n. You can set two of them and the third one will be derived. Usually δ\delta is written as a function of nn, but you can easily invert

General Tail Bounds

Tail bounds are important because they limit how large your tail is in a distribution, which naturally has implications for learning algorithms that could potentially exclude the tail of a distribution

Markov’s Inequality

P(Xt)E[X]tP(X \geq t) \leq \frac{E[X]}{t}

This shows that the tail of a distribution is limited in size. Proof: write as integrals and you’ll see it.

Chebyshev’s Inequality

P(Yμ>ϵ)σ2ϵ2P(|Y - \mu| > \epsilon) \leq \frac{\sigma^2}{\epsilon^2}

Now this is very interesting, because it shows that variance has something to do with convergence rates

General Tail Bound

Claim: if XX is real, then for any t0t\geq 0,

P(X>c)E[exp(tXtc)]P(X > c) \leq E[\exp(tX -tc)]

Gaussian tail bound

Claim: if ZZ is gaussian with mean 0 and variance 1, then

P(Z>c)exp(c2/2)P(Z > |c|) \leq \exp(-c^2/2)

This gets us a stronger variant of Chebyshev’s inequality, and it is food for thought.

Hoeffding’s inequality

Hoeffding’s inequality in the general form is as follows: given X1,XnX_1, … X_n and ai<Xi<bia_i < X_i < b_i, we have

We often find a looser bound by replacing biaib_i - a_i term with σ2\sigma^2, because the range is always less than the variance.

💡
we drop the factor of 22 outside of the exponential if there is no absolute value inside the probability

Common forms include

Uses of Hoeffding’s inequality

Anytime you see some sample sum or mean, you can use Hoeffding’s inequality to establish some bounds on how far we stray from the mean

Law of large numbers

limnp(Xˉnμ>ϵ)0\lim_{n\rightarrow \infty}p(|\bar{X}_n - \mu|> \epsilon) \rightarrow 0

The laws of large numbers can be derived from Hoeffding’s inequality: we see that the sample mean gets closer to the true mean.

McDiarmid’s Inequality

This states that for any IID random variables x1,,xnx_1, …, x_n and any function ff that has bounded differences (i.e. substituting the ith coordinate f(x1,,xn)f(x_1, …, x_n) changes ff by at most a finite cc. Then…

P(f(x1,...,xn)E[f(x1,...,xn)]>t)exp(2t2/n)P(|f(x_1, ..., x_n) - E[f(x_1, ..., x_n)]| > t) \leq \exp(-2t^2/n)

As usual, the alternate (average) form is

P(fˉ(x1,...,xn)E[fˉ(x1,...,xn)]>t)exp(2nt2)P(|\bar{f}(x_1, ..., x_n) - E[\bar{f}(x_1, ..., x_n)]| > t) \leq \exp(-2nt^2)

Now, if ff is the summation, we get Hoeffding’s inequality.

The proof comes from Martingales, and so we will omit it here. Just know that this is a powerful inequality that bounds samples to an expectation.

CLT (Central Limit Theorem)

The CLT leverages the law of large numbers to make the claim that all RV averages are gaussian in nature.

If X1,,XnX_1, …, X_n are IID random variables from distribution XX that has covariance Σ\Sigma. Let Xˉ=1nXi\bar{X} = \frac{1}{n} \sum X_i, then

  1. XˉpE[X]\bar{X}\overset{p}\rightarrow E[X] (this is the weak law of large numbers)
  1. n(XˉE[X])dN(0,Σ)\sqrt{n}(\bar{X} - E[X]) \overset{d}→N(0, \Sigma)
    1. If you were to move this around, you’d get (XˉE[X])dN(0,Σ/n)(\bar{X} - E[X]) \overset{d}→N(0, \Sigma / n) which makes sense. You’re narrowing around the mean, which is a result of the law of large numbers

The CLT is derivable from Hoeffding’s inequality

Delta Method

Using point (2), we can apply the Delta Method and assert that for any non-zero derivative function hh, we have

This is a very helpful extension of the CLT.