Noise Free Coding Algorithms

TagsEE 276Noise

Moving from Theory

So in the original formulation, we showed that any random code Xn(W)X^n(W) can yield an error that goes to 0, as long as the rate is smaller than the capacity. But a random code is very bad! It is exponential in encoding and decoding (especially in decoding). Can we do better?

Linear Codes

Let’s go back to a string of bits. We represent W=b1,,bkW = b_1, …, b_k and Xn=x1,,xnX^n = x_1, …, x_n. What if we have some matrix GG in a finite-field of size 2 (i.e. it’s a bunch of 1’s and 0’s, and when you do matrix multiplication, you do mod 2), such that Xn=GWX^n = GW?

The encoding process you can imagine as creating a system of equations and sending a linear combination of inputs. The decoding process you can imagine as solving this system. If you create enough overdetermination, you will be able to solve the problem even if the signal becomes corrupt.

Linear Codes on Binary Erasure

The Binary Erasure Channel is the easiest example for how Linear codes work. If we have Xn=GWX^n = GW and the channel erases some of the XnX^n we are just left with a lower number of linear equations. As long as these equations are enough for one unique solution, we can still decode the original message.

How do you choose G? Well, you can just choose it randomly, and it’s possible to show that this GG is optimal with high probability.

Efficiency

The encoding is very simple: it’s just the complexity of matrix multiplication. Decoding is slightly more tricky: it requires an inversion of a matrix.

Intuition for why rate matters: linear codes

this is an additional section but it can be helpful to get us to understand why we care about rate RR. We know that there are nRnR bits in the message and nn bits in the code. In our case, R<1R < 1. Let’s assume a random code Xn(W)X^n(W) where every WW is mapped randomly. And let’s assume that the channel itself is noiseless.

Proof of functionality

The tl;dr is that any matrix multiplication of G is just a summation of IID vectors, which is IID. Then you can plug this argument into the random code argument. The slight caveat is that the G0G0 case is not uniformly distributed. This is only one point, so it is fine.

Polar Codes: The Theory

The big idea

For simplicity, let’s say we have u1,,uku_1, …, u_k IID Ber(0.5), which is our message. We pass it into a coding scheme GG, yielding x1,,xnx_1, …, x_n, and then into the channel [W][W] to yield y1,,yny_1, …,y_n. Under the current system, the uu’s get encoded in to xx and then get sent through one type of channel. Therefore, the noise is “distributed” across the uu.

In polar codes, we want to shift the noise such that it is asymmetrical. We have some bits of uu with very low noise (i.e. very easy to figure out after transmission) and other bits that essentially only act as checksums. (this is why the code is called polar: because things get pushed to opposites).

The Assumptions

  1. Standard U[G]X[W]YU → [G] → X →[W] →Y pipeline
  1. The coding matrix GG is full rank (no loss of information)
  1. UU is IID uniform, and any linear transformation of UU is still uniform (because GG is invertible)

The Encoder G

Let’s start with the simplest case where there are only two bits. We do this because we want to show a fundamental property that allows us to construct arbitrary sizes.

Because we assume that GG is full rank, there is really only one option (and its symmetric option):

G=[1101]G = \begin{bmatrix} 1 & 1 \\ 0 & 1\end{bmatrix}

And because we are dealing with finite field of size 2, we can actually represent this as the following diagram:

In fact, all GG can be represented as a collection of these XOR gates (just think about it for a second). The capacity of the channel with GG is 2C2C, where CC is the capacity of [p][p].

Building up the Polarizer: the Chain Rule

Now, the capacity is 2C2C, and every bit contributes CC to the capacity. Can we somehow polarize this, such that one bit has more capacity than the other?

Well, note that

I(U1,U2;Y1,Y2)=I(U1;Y1,Y2)+I(U2;Y1,Y2U1)I(U_1, U_2 ; Y_1, Y_2) = I(U_1 ; Y_1, Y_2) + I(U_2 ; Y_1, Y_2 | U_1)

This creates two separate channels!

The first channel

The first channel looks like this:

We are trying to infer Y1,Y2Y_1, Y_2 from U1U_1. In this channel, U2U_2 is just noise.

The second channel

The second channel is a bit more tricky, because we don’t know what the conditional MI means. However, from the chain rule, we get

I(U2;Y1,Y2U1)+I(U2;U1)=I(U2;Y1,Y2,U1)I(U_2 ; Y_1, Y_2 | U_1) + I(U_2; U_1) = I(U_2 ; Y_1, Y_2, U_1)

And we know that the U’s are independent, so I(U2;U1)=0I(U_2 ; U_1) = 0. So, our second channel looks like this:

Now, this is just a repetition code! We are trying to infer Y1,Y2Y_1, Y_2 from U2U_2.

Here, we note that this might seem weird that the channel feeds back to predict U1U_1. In reality, we don’t care bout the U1U_1, but we do care about the Y1,Y2Y_1, Y_2.

Intuition about the two channels

The first channel is between U1U_1 and Y1,Y2Y_1, Y_2. Intuitively, you make a prediction on U1U_1, and then you graduate onto the second channel, which is between U2U_2 and Y1,Y2,U1Y_1, Y_2, U_1. This means that you’ve gone ahead and observed U1U_1 and let it be part of your decoding input.

The idea of these two channels is that we build up the next channel using the UsU’s we’ve already used.

More generally, it’s possible to show that

which shows the recurrent relationship.

As an alternative intuition, you see how the less noisy channel relies on the more noisy channel almost as a checksum. Polarization is just the assignment of certain channels as checksums of the other channels.

Polarization Theory

It is possible to show that

Or in other words, we will reach maximum polarization, at which point the proportion of non-zero capacities will be equal to the capacity.

The Polarizer on BEC: Case study

We can show polarization really easily on a binary erasure channel. Let’s say the erasure probability is pp.

  • the first channel requires both Y1,Y2Y_1, Y_2 to decode U1U_1 because of that XOR. So, if either one is flipped, we essentially erase everything. So, this new channel is also BEC with erasure probability (1(1p)2)(1 - (1-p)^2)
  • The second channel is a repetition code, so it requires both erasures to render the information irrecoverable. So, the new channel is also BEC with erasure probability p2p^2.

For a binary erasure channel, C=1pC = 1- p. You can easily tell that the first channel has capacity C1=(1p)2C_1 = (1-p)^2, which is smaller than CC unless the C=0,1C = 0, 1 (trivial cases). And p2<pp^2 < p, so 1p2>1p1 - p^2 > 1 - p, and therefore the second channel has higher capacity than CC.

We can continue to extend this result recursively. You can imagine the general rule

  • Enrich channel: take the old parameter pp and create new p=p2p’ = p^2
  • Deplete channel: take the old parameter pp and create new p=(1(1p)2)p’ = (1 - (1-p)^2).

Polar Codes: The Implementation

General polarization

We just apply the Polarization algorithm to the positive W+W^+ and the negative WW^- channels after initial polarization. This yields W+,W++,W+,WW^{+-}, W^{++}, W^{-+}, W^{—}. Keep doing this until there are 2k2^k channels.

Encoder

We’ve already gotten a hint on how to use the polarizer. We can continue to run the polarizer on its outputs. This creates a tree of channels with different properties. But this is all in theory. How exactly is this done?

Well, to strip it all down, it’s actually just GuGu, where GG has a special structure.

You can construct things recursively and do matrix multiplication, but this is O(n2)O(n^2). Can we do better?

Yes! We can compute the matrix multiplication using recursive computation through block matrices. Consider this:

Gn[uv]=[Gn/2u+Gn/2vGn/2v]G_n\begin{bmatrix} u \\ v\end{bmatrix} = \begin{bmatrix} G_{n/2}u + G_{n/2}v \\ G_{n/2}v \end{bmatrix}

So this becomes a previous recursive case and vector addition. The vector addition is O(n)O(n), so the total complexity is O(nlogn)O(n\log n). But in general, encoding is matrix multiplication.

When you decide to fix some subset of UU as 0, it’s equivalent to removing those columns of GnG_n. So the final encoding is just a GnG_n missing the columns that you chose to remove.

Decoder

Ultimately, you want to compute p(yn,u1i1ui)p(y^n,u_1^{i-1}| u_i) The base cases: p(y12,u1u2)=12p(y2u2)p(y1u2u1)p(y_1^2,u_1 │u_2 )=\frac{1}{2} p(y_2 | u_2) p(y_1 | u_2⊕u_1), and p(y12u1)=12u2p(y1u1u2)p(y2u2)p(y_1^2│u1)=\frac{1}{2}∑_{u_2}p(y_1 | u_1⊕u_2)p(y_2 | u_2). If you look at the way that the G is constructed, you can always find the outermost modified W’s that form a single one-step polarized channel. You only need to unpeel a certain subset of these connections and it reduces down to a self-similar problem. The exactly mechanisms is pretty hairy, but know that it is recursive and runs in O(n^2) time.

Visually, it’s a forward backward process.