Joint Typical Sets and Optimal Coding

TagsEE 276Noise

Vocabulary

Proof tips

Joint Typical Sets

So we have gotten us a nice property, a necessary condition of a noiseless encoder. However, we have made little progress on how we actually construct such a thing. In our discussion on compression, we essentially started with the concept of a typical set, and from that typical set came the primitive compression strategy. Let’s try a similar workup here.

We need the concept of a joint typical set because we have a set of inputs XX and a set of outputs YY to the coding channel.

The Joint Typical Set Definition 🐧

A sequence xn,ynx^n, y^n is jointly typical if three conditions are met

  1. XX is typical
  1. YY is typical
  1. The equation below
2n(H(X,Y)+ϵ)<p(xn,yn)=p(ynxn)p(xn)<2n(H(X,Y)ϵ)2^{-n(H(X, Y) +\epsilon)}< p(x^n, y^n) = p(y^n | x^n)p(x^n)< 2^{-n(H(X, Y) -\epsilon)}

We arrived at this notion of bounding p(ynxn)p(y^n|x^n). This has a special meaning: let’s formalize this idea of conditional typical set:

2n(H(XY)+ϵ)<p(xnyn)<2n(H(XY)ϵ)2^{-n(H(X |Y) +\epsilon)}< p(x^n| y^n)< 2^{-n(H(X|Y) -\epsilon)}

This means that every output yny^n has a set of xnx^n that is “most expected” to cause this yny^n output.

Understanding the relationship

Here is a good visualization.

Properties of Joint typical Set

  1. Pr(A)1Pr(A) → 1
    • Proof

      We know that all XX, YY converge to the entropy in probability, as well as X,YX, Y due to the law of large numbers and the AEP. Therefore, intuitively, the union of the set of things that don’t satisfy the jointly typical set will get smaller and smaller.

  1. A2n(H(X,Y)+ϵ)|A| \leq 2^{n(H(X, Y) + \epsilon)}
    • Proof

      Same as the proof for the singular typical set bound

  1. A(1ϵ)2n(H(X,Y)ϵ)A| \geq (1-\epsilon)2^{n(H(X, Y) - \epsilon)}
    • Proof

      The argument is actually very nuanced because there are X, Y that are not typical. We bypass this by using the AEP on the joint.

      As a back-of-the envelope calculation, at very large N, everything will be typical. The size of typical set of X is 2nH(X)2^{nH(X)}, and the size of the conditional typical set is 2nH(YX)2^{nH(Y|X)}. If we assume that we can satisfy the Y typicality just because it’s far more likely to be tyipcal with Y at large N, then we can just multiply these together to get 2nH(X)+H(YX)=2n(H(X,Y))2^{nH(X) + H(Y|X)} = 2^{n(H(X,Y))}.

  1. If we let X~p(xn)\tilde{X}\sim p(x^n) and Y~p(yn)\tilde{Y} \sim p(y^n) (i.e. marginalize and pretend tnat X and Y are independent), then Pr(X~n,Y~nA)2n(I(X;Y)3ϵ)Pr(\tilde{X}^n, \tilde{Y}^n \in A) \leq 2^{-n(I(X;Y)-3\epsilon)} and for large enough nn, we have Pr(X~n,Y~nA)(1ϵ)2n(I(X;Y)+3ϵ)Pr(\tilde{X}^n, \tilde{Y}^n \in A) \geq (1-\epsilon)2^{n(I(X;Y)+3\epsilon)}
    • Proof

      This is true because 1) the inside probabilities are bounded because of the marginal restrictions on the typical set and 2) the size of the A is bounded.

      Similar reason as above.

We care about this last property because it basically says that the probability that a randomly chosen pair from the marginals is jointly typical is about 2nI(X;Y)2^{-nI(X;Y)}. A signal is unique if the pair is in the jointly typical set. There may be other confounding interpretations, but if it’s outside this typical set, then these get asymptotically smaller as nn gets large.

This indicates that there are 2n(I(X;Y)2^{n(I(X;Y)} distinguishable signals.

Channel Coding Theorem

The Setup 🐧

Let’s review some terms introduced in the previous section. We have CC, the channel capacity. We have RR, the rate measured in bits code per bit of channel. We have mm, which is the number of bits in the message WW The diagram looks like this:

WXnYnW^W \rightarrow X^n \rightarrow Y^n \rightarrow \hat{W}

The number of bits in WW is just nRnR.

Sometimes, we denote the code as Xn(W)X^n(W) which shows how this W is encoded through some sort of encoding (not compression; that’s done before W).

We assume that the channel is discrete and memoryless, which means that p(ymxm)=p(yx)p(y^m | x^m) = \prod p(y|x). This does not imply that all XX is independent. We do know that WW is IID uniform, but the coding algorithm may not choose to keep that IID uniform.

The Coding Theorem ⭐

We assert that if RCR \leq C, then the probability of error becomes arbitrarily small with a smart choice of code. It is not true for any code.

To prove this, we will first actually show that any randomly chosen code will yield a probability of error that gets arbitrarily small.

Why we care

So essentially, we want to show that there exists some coding scheme (denoted as dots in the XnX^n oval) where we can cleanly infer the input given an output YnY^n. This is non-trivial. The red parenthetical is the jointly typical set given this yny^n, and there are quite a few.

The proof

We start by making an important assumption: we only deal with typical sets of Xn,YnX^n, Y^n, because as nn gets large enough, all sequences become close to typical.

So let’s start with a randomly generated code. In other words, we start by generating a bunch of random XnX^n and binding them to WW. Why? Well, the XnX^n would be uniform, and if we have a symmetric channel (our assumption), then we would be at capacity. Our limit of R is based on the fact that the P(x) hits capacity.

If the channel is symmetric, then the XnX^n will be uniform and every XnX^n is the typical set. This is the easiest to work with. If the channel is not symmetric, you will use the optimizing distribution ber(p)ber(p). This is slightly more complicated. For now, assume that every code we bind Xn(W)X^n(W) will be in the typical set of XnX^n.

Now, the true Xn(W)X^n(W) will be in the conditional typical set with high likelihood (by the AEP), so we don’t have to worry about that. Instead, we are concerned about some other Xn(W)X^n(W’) being in the region of the conditional typical set of YnY^n, which would be a problem. Why? Well, then there’s ambiguity in the decoding!

But here’s where the random code selection comes in clutch. From the AEP properties, we know that there is around 2nH(XY)2^{nH(X | Y)} elements in the conditional, and 2n(H(X)2^{n(H(X)} in the total typical of XnX^n. Therefore, the probability that another code lands in the conditional is just

2nH(XY)/2n(H(X)=2n(H(X)H(XY)=2nI(X;Y)2^{nH(X | Y)} / 2^{n(H(X)} = 2^{-n(H(X) - H(X | Y)} = 2^{-nI(X; Y)}

So this is the probability of a single Xn(W)X^n(W’) being in the conditional. What about all of the WW? Well, we can easily set up a union bound

P(E)02nR2n(I(X;Y)=2nRn(I(X;Y)=2n(I(X;Y)R)=2n(CR)P(E)\leq \sum_0^{2^{nR}}2^{-n(I(X;Y)}= 2^{nR - n(I(X;Y)} = 2^{-n(I(X;Y) - R)}= 2^{-n(C - R)}

For context, remember that there are 2nR2^{nR} different WW’s. And the last inequality is where we use our original claim that C>RC > R. The final equality comes from the assumption that the distribution of XX is chosen to meet the capacity.

From here, it becomes obvious. As nn→ \infty, the P(E)P(E) pinches down to 0. And the slower you send, the faster the error rate drops.

And finally, we need to make one final conclusion. As this error probability is averaged over the randomness over the channel AND random code (implicitly; all random codes have the same bound), we can use the law of averages to say that there must exist at least one code that does better than the average random code, so we are done.

Proof (more rigorous, different approach)

To actually prove this, we need to establish what an error is. An error can happen in two ways

  1. The true code Xn(W)X^n(W) is not in the typical set of YnY^n
  1. There are other imposter codes Xn(W)X^n(W’) that are in the typical set of YnY^n.

Now, let’s formalize this. Define event EiE_i as the following:

Error 1) happens when EiE_i isn’t true at the correct WW. Error 2) happens when at least one other EiE_i is true at the incorrect WW’. Using the symmetry of codes, we can just assume that we are coding W=1W = 1. This yields the following

Now, we know from the AEP that the first term becomes smaller than ϵ\epsilon for large enough nn. From property #4 of the AEP, we know a bound on two sequences being in the typical set. This yields the following workup

And so we have shown that the probability of error pinches down to zero!

The Bound on Communication

The communication limit 🚀

Here’s the a key theorem: if R>CR > C, then pep_e is large, where pep_e is the error probability in transmission.

We will now prove this theorem by finding a lower bound on pep_e. We see that in certain circumstances, the lower bound is zero. These are the times when it is possible to have an error-free code.

This is the CONVERSE of the Channel coding theorem. We prove it through the contrapositive. The converse is Pe=ϵRCP_e = \epsilon → R\leq C, and we show that if R>CR > C, then Pe>0P_e > 0.

Proof: starting with Fano’s Inequality

Now, this is very lofty, because it seems like pep_e is very hard to quantity. However, let’s think for a second. The error probability sort of maps to the entropy of the decoding, given the input, right? Or, in other words, H(WYm)H(W | Y^m). If we want to have an arbitrarily small probability of error, this conditional entropy should be near zero. Otherwise, it means that there is some uncertainty about the original message given the output of the channel.

Now indeed, as we remember from Fano’s inequality, we know that

PeH(WYn)1log(m1)P_e \geq \frac{H(W | Y^n) - 1}{\log (m-1)}

where mm is just the size of the input, so m=2nRm =2^{nR}. (to review, look at the notes on Fano’s inequality).

Moving towards the bound

Lets’ just consider H(WYn)H(W | Y^n). We know that I(W;Yn)=H(W)H(WYn)I(W ; Y^n) = H(W) - H(W | Y^n), which means that

H(WYn)=H(W)I(W;Yn)H(W | Y^n) = H(W) -I(W;Y^n)

Now, because W is IID uniform, we know that H(W)=nRH(W) = nR, which means that

H(WYn)=nRI(W;Yn)H(W | Y^n) = nR -I(W;Y^n)

Intuitively, this makes sense. The WW entropy adds noise, and the correlation between the input and the output can reduce some of that noise, but only to a certain point.

Now, because WXnYnW → X^n → Y^n, we can use the data processing inequality to state that I(W;Yn)I(Xn;Yn)I(W;Y^n) \leq I(X^n; Y^n). Therefore, we have

H(WYn)nRI(Xn;Yn)H(W | Y^n) \geq nR -I(X^n;Y^n)

Let’s continue. We can factor the mutual information into I(Xn;Yn)=H(Yn)H(YnXn)I(X^n ; Y^n) = H(Y^n) - H(Y^n | X^n). Now, the second term is very easy because of the discrete memoryless channel assumption. H(YnXn)=iH(YiXi)H(Y^n | X^n) = \sum_i H(Y_i | X_i). The first term isn’t actually easy to compute, but we can bound it on independence. From this, we have H(Yn)iH(Yi)H(Y^n) \leq \sum_i H(Y_i).

Putting this together, we have

I(Xn;Yn)iH(Yi)H(YiXi)=iI(Xi;Yi)I(X^n;Y^n) \leq \sum_i H(Y_i) - H(Y_i|X_i) = \sum_i I(X_i; Y_i)

This last term depends on the channel properties as well as the distribution p(x)p(x). However, we know a bound on this! It’s just CC. So I(Xn;Yn)nCI(X^n;Y^n) \leq nC.

Putting this together, we have

H(WYn)nRnC=n(RC)H(W | Y^n) \geq nR - nC = n(R - C)

Already, we are starting to see that if R>CR > C, then the lower bound is non-zero, which is bad.

The final move, we just plug into Fano:

PeH(WYn)1log(m1)n(RC)1log2nR(1CR)\boxed{P_e \geq \frac{H(W | Y^n) - 1}{\log (m-1)} \geq \frac{n(R - C) -1}{\log 2^{nR}}\rightarrow (1 - \frac{C}{R})}

asymptotically. And this is exactly what we are looking for! If R>CR > C, then the lower bound is non-zero, which means that the probability of error must be greater than this non-zero value.

Slight caveat on the bound

So this is based on the possiblity that p(x)p^*(x) is reached. The more general result is

pe(1I(X;Y)R)p_e \geq (1 - \frac{I(X; Y)}{R})

Because sometimes you might be artificially limited somehow to a class of p(x)p(x) that doesn’t achieve optimality. So the lesson: your RR can’t exceed the CURRENT mutual information.

Simple case: zero-error codes

We can show the converse directly (without the contrapositive) if we know that the error probability is zero. The key here is that if Pe=0P_e = 0, then necessairly, we have H(WYn)=0H(W | Y^n) = 0.

Then, we have

(a) is data processing inequality, (b) is independence bounding (see above), and (c) is predefined channel coding limit.

This is just the stripped down version of the proof above. In the proof above, we had to give a lower bound on H(WYn)H(W| Y^n), and showed that this lower bound is only zero in certain cases.

Rates and Distributions

This is where I explain something a bit confusing.

So there’s this idea of an optimal distribution p(x)p(x) that reaches C=maxp(x)I(X;Y)C = \max_{p(x)}I(X;Y). In binary symmetric channels, this is always uniform. But you shouldn’t assume that it’s uniform all the time. Because sometimes you might need to compromise, etc.

There’s this idea of an input distribution u1,,unu_1, …, u_n, which we assume a priori that it is IID and uniform. So the message WW is uniform.

There’s this idea of an encoder G(W)G(W) which takes WW and transforms it into XnX^n. This can and will entangle the individual bits, so XnX^n is no longer IID. That doesn’t matter. As long as the individual marginals fit the p(x)p^*(x), we are still at capacity.

So where does the rate come in? Well, you can always decide how much information you cram into GG, i.e. how large your WW is. We don’t care how this GG is constructed just yet; we imagine that we can adjust it for different size WW. The ratio of your WW size to the size of XX is your rate.

Now, if we assume that our GG can output XnX^n distributed like p(x)p^*(x), then the rate is restricted as RCR \leq C.

But what if that’s not the case? Well, then the rate limit changes. In the derivation of the Channel Coding theorem and the converse, we always bound I(X;Y)CI(X; Y) \leq C and use CC to replace the I(X;Y)I(X;Y). Suppose that there is some tighter bound CC’ due to some other constraint. It might be that your GG isn’t expressive enough, or you need to share a channel with something else, etc. In this case, you must replace the bound with this value.

Channel definition

You can define an arbitrary channel through the mutual information. You might have I(X;Y1,Y2)I(X; Y_1, Y_2), which means that you send out one signal and you try to decode it by looking at two outputs. Every mutual information can yield a channel!