Asymptotic Equipartition Property

TagsCodingEE 276

Proof tips

Sending Codes

So let’s formalize the problem of code-sending and establish why we care. You want to send a message that you don’t know (not pre-determined) across a channel. For now, let’s assume that the channel has no noise. Can we send with as little number of bits as possible?

We model the code as a string of random variables. The parameters of the random variables are determined by the source (think of each random variable as a character, etc). Depending on the properties of the random variable string, you may find some neat properties.

In reality, the RV’s are dependent on each other through the markov property with potentially a long history (think natural language). However, we may, as a simplification, think of them as IID. This section shows some critical theory under this IID assumption. How can you best compress a string of samples?

AEP Theorem 🚀

It is similar to the law of large numbers. The law of large numbers says that the average of samples is close to the expectation of an RV. The AEP states that you can approximate the H(X)H(X) by taking samples X1,XnX_1, … X_n and computing 1nlog1p(X1,,Xn)\frac{1}{n}\log \frac{1}{p(X_1, …, X_n)}.

Formally, if we have IID X1,XnX_1, … X_n from some distribution p(X=x)p(X = x), then

P(1nnlog1p(Xn)H(X)>ϵ)0P(\left|\frac{1}{n}\sum_n\log \frac{1}{p(X_n)} - H(X)\right| > \epsilon) \rightarrow 0

Or, more formally, for every δ\delta, there exists some N(δ)N(\delta) such that for all n>N(δ)n > N(\delta), we have

P(1nnlog1p(Xn)H(X)>ϵ)<δP(\left|\frac{1}{n}\sum_n\log \frac{1}{p(X_n)} - H(X)\right| > \epsilon) < \delta
Note that we are only varying δ\delta, the ϵ\epsilon stays in place unlike traditional epsilon limits.

You can understand that this 1p(Xn)\frac{1}{p(X_n)} is a random variable where you sample something and feed that right back into its own distribution. So it follows from the law of large numbers that the AEP is true.

This is only true for IID samples only! This assumption is baked into the ability to factor the joint distribution into individual distributions.

Generalized AEP

We will talk about entropy rates later, but you can essentially say the same thing about H(χ)H(\chi), where H(χ)H(\chi) is the entropy rate. In other words:

P(1nlog1p(X1,...,Xn)H(χ)>ϵ)0P(\left|\frac{1}{n}\log \frac{1}{p(X_1, ..., X_n)} - H(\chi)\right| > \epsilon) \rightarrow 0

Understanding AEP

To truly understand the AEP, let’s look at an example with bernoulli with probability pp. In this IID sequence, permutation doesn’t matter. It only matters the number kk of ones in the sequence. Recall that p(xn)=pk(1p)nkp(x^n) = p^k(1-p)^{n-k}.

Now, if you plug this into the AEP, you can say that

which is equivalent to

knpϵlog(1p)/p)|\frac{k}{n}-p| \leq \frac{\epsilon}{\log(1-p)/p)}

which should remind you a lot of the law of large numbers. Just set ϵ=1log(1p)/p\epsilon’ = \frac{1}{\log(1-p)/p} and you’re done. This gets you a bound k/npϵ|k/n - p| \leq \epsilon’ for every sequence of kk terms that are in the AEP. Because the AEP has Pr(A)1δPr(A) \geq 1-\delta, it means that for any ϵ\epsilon’, we can show that any sampled sequence has k/npk/n - p is within an epsilon sphere with confidence 1δ1-\delta

When does this break down? It breaks down when p=1/2p=1/2 and log(1p)/p=0\log(1-p)/p = 0. In this case, there is no bound on kk. This makes sense as the probability no longer depends on kk if p=1/2p = 1/2.

Typical Sets 🐧

We define the typical set Aϵ(n)A_\epsilon^{(n)} as the set of sequences (x1,x2,,xn)(x_1, x_2, …, x_n) drawn IID from XX, where

Don’t try to think about why these bounds exist; as we will see, they exist only so we have some nice properties.

Intuitively, the typical set is just the set of sequences that are the least surprising. A typical set may be a small subset of all set of sequences, but as we will see below, an element sampled from the random variable is highly likely to be one of the typical set elements.

As a non-probability version equivalent, you might consider the set of samples whose mean is within an epsilon-ball of the true mean. This is a typical set for estimating sample means. If you’re ever confused, think of means instead of entropy, and all of the properties should be intuitive.

Properties of Typical Sets

  1. For all sequences inside a typical set, we have H(X)ϵ1nlog1p(x1,,xn)H(X)+ϵH(X) - \epsilon \leq \frac{1}{n}\log \frac{1}{p(x_1, …, x_n)} \leq H(X) + \epsilon. for any ϵ>0\epsilon > 0 and large enough nn. In other words, inside the typical set, our estimated entropy is within an epsilon-ball of the true entropy. Proof through Typical Sets definition.
  1. The probability that a sequence belongs in the Typical Set approaches 1: P((x,)Aϵ(n))>1ϵP((x,…) \in A_\epsilon^{(n)}) > 1 - \epsilon for a sufficiently large nn.
    • Proof (Using AEP)

      By the delta-epsilon definition of AEP, we have that

      which is true for all sequences of random variables. Now, we observe that when this log-prob is inside the epsilon ball of HH, it ALSO corresponds to being in the typical set. Therefore, we just need to set δ=ϵ\delta = \epsilon and we are done.

  1. There is an upper bound to the number of elements in the typical set: Aϵ(n)2n(H(X)+ϵ)|A_\epsilon^{(n)}|\leq 2^{n(H(X) + \epsilon)}.
    • Proof (definitions)

      This should be pretty obvious. The probabilities of each element in the typical set is upper bounded

      which you can easily solve.

  1. There is a lower bound to the number of elements in the typical set: Aϵ(n)(1ϵ)2n(H(X)ϵ)|A_\epsilon^{(n)}| \geq (1-\epsilon)2^{n(H(X)-\epsilon)}.
    • Proof (using property #2)

The variance in probability increases with nn due to the nϵn\epsilon term. This is actually pretty intersting!

Data Compression with AEP

So here’s the big point of talking about AEP. It seems weird to be able to compress things of size Xn|X|^n into their entropy nHnH on expectation, because it feels like we are shoving too much data in too little space. Information theory focuses on how compression is possible. By showing that a Typical Set expands to become ϵ\epsilon-close to the full probability mass, and by the knowledge that the Typical set is bounded by size, then we know for certain that we can indeed encode with HH bits per character.

Naive Compression

Naive compression assigns an index to every possible sequence, which has size χn|\chi|^n. We can rewrite this as 2nlogχ2^{n\log |\chi|}, which means that it takes logχ\log |\chi| bits per sample. Can we do better?

Using Typical Sets

Because there are at most 2n(H+ϵ)2^{n(H + \epsilon)} elements in the typical set, we can describe it with n(H+ϵ)n(H + \epsilon) bits, which means that we only use HH bits per sample.

Now, interestingly, if the distribution of XX is uniform, then H(X)=logχH(X) = \log |\chi|, so the typical set is everything, and all encoding is equally inefficient. So this establishes, through theory, what we mean by compression.

Optimal Coding theorem 🚀

Let XnX^n represent an n-length sequence. There exists an invertible code such that

So the tl;dr is that on average, we can represent sequences XnX^n with nH(X)nH(X) bits on average! But how can we actually do this? Well, we’ll find out the concrete coding schemes in the next set of notes.

Ultimately, why shall we care?

We have established a theoretical possibility of coding. And in fact, we can technically have a primitive proof-of-concept code in which we assign a code of length HH to sequences inside the typical set, and kept all other sequences as-is. In this case, we have shown that the compression length is H(X)H(X) asymptotically.

But this is super primitive and has problems.

In later sections, we will show how prefix codes can reach this theoretical possibility, to within one bit.

High Probability Sets

We define a high probability set as Bδ(n)B_\delta^{(n)} as the set such that P(Bδ(n))1δP(B_\delta^{(n)})\geq 1-\delta. In other words, the probability of being in this set is larger than 1δ1-\delta.

We can construct BB greedily by taking the most likely sequences first. As it turns out, B(n)B^{(n)} has at least 2nH2^{nH} elements, which is in an order of exponent of 2n(H±ϵ)2^{n (H \pm \epsilon)}, which is the size of A(n)A^{(n)}.

The takeaway is that the highest likelihood sequences (high probability sets) have intersections with the typical set.

High Probability vs Typical sets

The tl;dr is that B is limited as a set (to the likelihood), while A is limited per value (by being within the right range). You can’t say much about BB, but BB is related to AA through some simple properties.

The high probability set is subtly different from a typical set. If we construct the smallest possible high probability set in a bernoulli sequence with p=0.8p = 0.8, you will definitely see the sequence where everything is 1. However, the typical set may not contain that. The typical set contains samples whose probabilities are closest to the empirical entropy, which may not include the highest possible likelihood (because that may not be surprising enough!)

However, we do know from basic probability that for P(Bδ(n))1δP(B_\delta^{(n)})\geq 1-\delta and P(Aϵ)1ϵP(A_\epsilon)\geq 1 - \epsilon (true for large enough numbers), we have

Pr(BδAϵ)1ϵδPr(B_\delta\cap A_\epsilon) \geq 1 - \epsilon -\delta

which means that there is a large intersection between the high probability and typical sets.

We also know that the smallest possible BδB_\delta is upper bounded by the size of AϵA_\epsilon at large enough nn, as the AA set gets close to the full set, while the smallest possible BδB_\delta is upper bounded by δχn\delta * |\chi^n|.