Noisy Channels

TagsEE 276Noise

Proof tips

Moving to Noisy Channels

Now, we are graduating onto channels that have noise, i.e. you send in some xx and you get p(yx)p(y | x). What does that look like? Well, previously we had info → encoder → decoder → info. Now, we need info → encoder → channel encoder → NOISE → channel decoder → decoder → approximate info

The Assumptions

Our job is to make this meaningful channel encoder and decoder.

IID Uniform inputs: Fortunately, we can actually make a key assumption about the data going into this set of encoders. Because we assume that we have a good data encoder algorithm, we can assume that the input is binary RV’s that are IID. Why IID? Well, if the variable had any sort of correlation, we can further compress until they are maximally orthogonal, in terms of probability.

By assuming that the input and output are just binary streams, we can actually start thinking about them as numbers that these binary things represent. Our goal is to send one number and reconstruct it as accurate as possible.

Discrete and Memoryless Channel (DMC): we can assume that the noise is memoryless, i.e. p(y1,..,ymx1,,xm)=p(y1x1)p(ymxm)p(y_1, .., y_m | x_1, …, x_m) = p(y_1|x_1)…p(y_m|x_m). This is very similar to an IID assumption.

The channel goes WXYW^W → X → Y → \hat{W}. We can’t assume any IID of the X or Y.

Nomenclature

We use an arrow diagram to denote the distributions over outcomes, which can be helpful to visualize thing

Performance Metrics

There are two things we care about

  1. Reliability: Can we have P(WW^)=pϵ<<1P(W \neq \hat{W}) = p_\epsilon << 1? (where WW is the whole message)
  1. Rate. Can we maximize the number of information bits per channel symbol?

Intuitively, these two metrics are battling. You can easily get a very reliable code if you have a really redundant signal, but this tanks your rate. Vice versa. In a later section, we will look at how these things aren’t necessarily competing if we do things correctly.

Channel Capacity

If the channel is symmetric, then naturally the maximizing distribution is uniform (think about it for a second)

Why we can’t just take the percentage of mistakes

So the big question is: how do we quantify how good a signal is? We can’t just subtract the average number of mistakes, because you can have a channel that is maximally uncorrelated with the input, and still not make a mistake 50% of the time (just by pure chance, if you guess a binary digit, it matches the source 50% of the time). Furthermore, sometimes you lose more than the average number of mistakes, as you don’t know where the mistakes are.

Moving to capacity

Instead, it helps to think about the correlation, or mutual information between the source and the output.

There’s something interesting about this setup. Mutual information depends on p(x,y)p(x, y), which is factored into p(yx)p(x)p(y | x)p(x). Now, p(yx)p(y|x) is a property of the transmission system. It’s p(x)p(x) that you can modify.

Therefore, we define capacity as

C=maxp(x)I(X;Y)C = \max_{p(x)}I(X;Y)

As we will see, there’s a strategy in how we design p(x)p(x). We want to have each xp(x)x\sim p(x) to map to as far separated yy as possible, such that the inverted p(xy)p(x | y) is as noiseless as possible. This is actually a concave function because II Is concave in p(x)p(x).

Meaning of capacity

Interestingly, this CC has a key meaning: below the rate of CC, you can send information with as high of a precision as possible!

We will prove this fact in the next page of notes.

Properties 🚀

How to optimize 🔨

So this objective is not very easy to optimize directly. You can use a few tricks

  1. I(X;Y)=H(Y)H(YX)I(X;Y) = H(Y) - H(Y | X). This is helpful because YY is maximized through a uniform distribution, and H(YX)H(Y | X) is often easy to derive.
  1. Now, the tricky part is to figure out (often) how to create a maximum entropy YY using XX. You might benefit from I(X;Y)=H(X)H(XY)I(X;Y) = H(X) - H(X | Y), which gives you a more direct formulation. The only problem is that the H(XY)H(X | Y) is now convoluted.
  1. You can also brute-force. Start with a representation of p(x)=[p1,p2,,pn]p(x) = [p_1, p_2, …, p_n]. If you have a conditional probability matrix, PP, then p(y)=P[p1,p2,pn]p(y) = P[p_1,p_2, …p_n]. Using PP, you wil also be able to derive H(YX)H(Y | X). Usually, this sort of brute-force optimization is messy and requires some sort of crucial insight. But it can be helpful to set it up this way.

What is rate? Capacity?

Rate is the number of message bits per channel bit. Capacity is also measured in the same units, and can never exceed 11 for binary channels. This is mathematically true because the Bernoulli RV H(X)1H(X) \leq 1.

Rate is determined in the encoder, i.e. the transfer from WXnW → X^n. The capacity is determined by the XYX → Y channel. The theory we develop will relate the rate to the capacity.

Example of Noisy Channels

Noiseless binary

Here, we consider a degenerate case where we have no noise at all. In this case, maxI(X;Y)=maxH(X)H(XY)=maxH(X)\max I(X ; Y) = \max H(X) - H(X | Y) = \max H(X), which is maximized with a uniform p(x)p(x).

Non-overlapping outputs

Consider the following distribution (repeated from nomenclature diagram):

This looks noisy, but actually, p(xy)p(x | y) is noiseless. So the channel is also maximized with a uniform p(x)p(x). This is a good example of how the H(X)H(XY)H(X) - H(X | Y) expansion is critical.

Binary Symmetric ⭐

Here, the inputs are complemented with probability pp. So everything that we receive is unreliable, but that doesn’t mean that we can’t send anything. We’ll see an algorithm later.

We can show that binary symmetric mutual information is bounded by pp: I(X;Y)1H(p)I(X;Y) \leq 1 - H(p). If p=0.5p = 0.5, it is impossible to carry any information.

Now, this has some interesting implications. At any given pp, there are 1p1-p bits of information left uncorrupted. However, if you plot the line 1p1-p vs 1H(p)1 - H(p), you will find that the curve of CC goes below the 1p1-p. This is intuitive; it means that by randomly flipping some bits, you lose more than just the proportion of bits you flipped, because you don’t know where they got flipped.

Binary Erasure ⭐

Here, instead of corrupting bits, we just lose some bits with probability α\alpha

We can actually derive exactly what CC is: C=1αC = 1 - \alpha. This is intuitive, because if we lose α\alpha proportion of bits, we can only recover 1α1-\alpha bits.

Interestingly, this channel is able to preserve in communication the average number of bits kept untouched, unlike what we saw in the binary symmetric channel.

Symmetric Channels (extra)

Consider a more general form, where we have p(yx)p(y|x) as a matrix

What can we say about the channel limits now?

Symmetric definition 🐧

Now, in the example above, we notice how the rows are permutations of each other, and so are the columns. We call this a symmetric channel. Intuitively, it means that the entropy of the output doesn’t depend on the input, but the distribution does.

If the rows are permutation and the column sums are equal, then it is weakly symmetric

Channel Capacity

We can find the channel capacity by performing the following standard derivation

And we can maximize H(Y)H(Y) by feeding in a uniformly distributed input

And this is true for weakly symmetric inputs as well