Noise Free Coding Algorithms

Tags	EE 276Noise

Moving from Theory

So in the original formulation, we showed that any random code $X^n(W)$ can yield an error that goes to 0, as long as the rate is smaller than the capacity. But a random code is very bad! It is exponential in encoding and decoding (especially in decoding). Can we do better?

Linear Codes

Let’s go back to a string of bits. We represent $W = b_1, …, b_k$ and $X^n = x_1, …, x_n$ . What if we have some matrix $G$ in a finite-field of size 2 (i.e. it’s a bunch of 1’s and 0’s, and when you do matrix multiplication, you do mod 2), such that $X^n = GW$ ?

The encoding process you can imagine as creating a system of equations and sending a linear combination of inputs. The decoding process you can imagine as solving this system. If you create enough overdetermination, you will be able to solve the problem even if the signal becomes corrupt.

Linear Codes on Binary Erasure

The Binary Erasure Channel is the easiest example for how Linear codes work. If we have $X^n = GW$ and the channel erases some of the $X^n$ we are just left with a lower number of linear equations. As long as these equations are enough for one unique solution, we can still decode the original message.

How do you choose G? Well, you can just choose it randomly, and it’s possible to show that this $G$ is optimal with high probability.

Efficiency

The encoding is very simple: it’s just the complexity of matrix multiplication. Decoding is slightly more tricky: it requires an inversion of a matrix.

Intuition for why rate matters: linear codes

this is an additional section but it can be helpful to get us to understand why we care about rate $R$ . We know that there are $nR$ bits in the message and $n$ bits in the code. In our case, $R < 1$ . Let’s assume a random code $X^n(W)$ where every $W$ is mapped randomly. And let’s assume that the channel itself is noiseless.

The chance of individual error is $1/2^n$ becuase there are this many codes.

There are $2^{nR}$ words. So the union bound states that $P(E)\leq 2^{nR}/2^n = 2^{n(R-1)}$ . This $E$ is the average error probability per bit. If $R < 1$ , we see how $n$ will push this error probability to 0. Why? Intuitively, there’s more space than there are words, so we can find ways of separating

If there are $2^n$ words ( $R = 1$ ), we run into interesting land. Using complements, we can calculate that the probability of no error is $(1 - 1/2^n)^{2^n}$ , which limits to $e^{-1}$ . This means that there will always be some probability of error, which makes sense. There is no gap that is growing larger.

Proof of functionality

The tl;dr is that any matrix multiplication of G is just a summation of IID vectors, which is IID. Then you can plug this argument into the random code argument. The slight caveat is that the $G0$ case is not uniformly distributed. This is only one point, so it is fine.

Polar Codes: The Theory

The big idea

For simplicity, let’s say we have $u_1, …, u_k$ IID Ber(0.5), which is our message. We pass it into a coding scheme $G$ , yielding $x_1, …, x_n$ , and then into the channel $[W]$ to yield $y_1, …,y_n$ . Under the current system, the $u$ ’s get encoded in to $x$ and then get sent through one type of channel. Therefore, the noise is “distributed” across the $u$ .

In polar codes, we want to shift the noise such that it is asymmetrical. We have some bits of $u$ with very low noise (i.e. very easy to figure out after transmission) and other bits that essentially only act as checksums. (this is why the code is called polar: because things get pushed to opposites).

The Assumptions

Standard $U → [G] → X →[W] →Y$ pipeline

The coding matrix $G$ is full rank (no loss of information)

$U$ is IID uniform, and any linear transformation of $U$ is still uniform (because $G$ is invertible)

The Encoder G

Let’s start with the simplest case where there are only two bits. We do this because we want to show a fundamental property that allows us to construct arbitrary sizes.

Because we assume that $G$ is full rank, there is really only one option (and its symmetric option):

G = \begin{bmatrix} 1 & 1 \\ 0 & 1\end{bmatrix}

And because we are dealing with finite field of size 2, we can actually represent this as the following diagram:

In fact, all $G$ can be represented as a collection of these XOR gates (just think about it for a second). The capacity of the channel with $G$ is $2C$ , where $C$ is the capacity of $[p]$ .

Building up the Polarizer: the Chain Rule

Now, the capacity is $2C$ , and every bit contributes $C$ to the capacity. Can we somehow polarize this, such that one bit has more capacity than the other?

Well, note that

I(U_1, U_2 ; Y_1, Y_2) = I(U_1 ; Y_1, Y_2) + I(U_2 ; Y_1, Y_2 | U_1)

This creates two separate channels!

The first channel

The first channel looks like this:

We are trying to infer $Y_1, Y_2$ from $U_1$ . In this channel, $U_2$ is just noise.

The second channel

The second channel is a bit more tricky, because we don’t know what the conditional MI means. However, from the chain rule, we get

I(U_2 ; Y_1, Y_2 | U_1) + I(U_2; U_1) = I(U_2 ; Y_1, Y_2, U_1)

And we know that the U’s are independent, so $I(U_2 ; U_1) = 0$ . So, our second channel looks like this:

Now, this is just a repetition code! We are trying to infer $Y_1, Y_2$ from $U_2$ .

Here, we note that this might seem weird that the channel feeds back to predict $U_1$ . In reality, we don’t care bout the $U_1$ , but we do care about the $Y_1, Y_2$ .

Intuition about the two channels

The first channel is between $U_1$ and $Y_1, Y_2$ . Intuitively, you make a prediction on $U_1$ , and then you graduate onto the second channel, which is between $U_2$ and $Y_1, Y_2, U_1$ . This means that you’ve gone ahead and observed $U_1$ and let it be part of your decoding input.

The idea of these two channels is that we build up the next channel using the $U’s$ we’ve already used.

More generally, it’s possible to show that

which shows the recurrent relationship.

As an alternative intuition, you see how the less noisy channel relies on the more noisy channel almost as a checksum. Polarization is just the assignment of certain channels as checksums of the other channels.

Polarization Theory

It is possible to show that

Or in other words, we will reach maximum polarization, at which point the proportion of non-zero capacities will be equal to the capacity.

Martingale
So the idea is that with larger enough chunks, you will get a lot of high capacity channels and a lot of low capacity channels, and very little in-between. The proof is more complicated and requires martingale, but you can imagine this as a random walk. On each step, you either take a step right or left. If you are bounded on either side by a wall, most of the random walks will end up stuck along that wall.

The Polarizer on BEC: Case study

We can show polarization really easily on a binary erasure channel. Let’s say the erasure probability is $p$ .

the first channel requires both $Y_1, Y_2$ to decode $U_1$ because of that XOR. So, if either one is flipped, we essentially erase everything. So, this new channel is also BEC with erasure probability $(1 - (1-p)^2)$

The second channel is a repetition code, so it requires both erasures to render the information irrecoverable. So, the new channel is also BEC with erasure probability $p^2$ .

For a binary erasure channel, $C = 1- p$ . You can easily tell that the first channel has capacity $C_1 = (1-p)^2$ , which is smaller than $C$ unless the $C = 0, 1$ (trivial cases). And $p^2 < p$ , so $1 - p^2 > 1 - p$ , and therefore the second channel has higher capacity than $C$ .

We can continue to extend this result recursively. You can imagine the general rule

Enrich channel: take the old parameter $p$ and create new $p’ = p^2$

Deplete channel: take the old parameter $p$ and create new $p’ = (1 - (1-p)^2)$ .

Polar Codes: The Implementation

General polarization

We just apply the Polarization algorithm to the positive $W^+$ and the negative $W^-$ channels after initial polarization. This yields $W^{+-}, W^{++}, W^{-+}, W^{—}$ . Keep doing this until there are $2^k$ channels.

Example on 4
First, create two pairs of channels
Then, treating the matching pairs as identical, apply the basic transform
We get the new channels
And so this looks like
The construction continues recursively. The final result is that you get a bunch of XOR operations that you can turn into a matrix.

Size 8 construction

Encoder

We’ve already gotten a hint on how to use the polarizer. We can continue to run the polarizer on its outputs. This creates a tree of channels with different properties. But this is all in theory. How exactly is this done?

Well, to strip it all down, it’s actually just $Gu$ , where $G$ has a special structure.

You can construct things recursively and do matrix multiplication, but this is $O(n^2)$ . Can we do better?

Yes! We can compute the matrix multiplication using recursive computation through block matrices. Consider this:

G_n\begin{bmatrix} u \\ v\end{bmatrix} = \begin{bmatrix} G_{n/2}u + G_{n/2}v \\ G_{n/2}v \end{bmatrix}

So this becomes a previous recursive case and vector addition. The vector addition is $O(n)$ , so the total complexity is $O(n\log n)$ . But in general, encoding is matrix multiplication.

When you decide to fix some subset of $U$ as 0, it’s equivalent to removing those columns of $G_n$ . So the final encoding is just a $G_n$ missing the columns that you chose to remove.

Decoder

Ultimately, you want to compute $p(y^n,u_1^{i-1}| u_i)$ The base cases: $p(y_1^2,u_1 │u_2 )=\frac{1}{2} p(y_2 | u_2) p(y_1 | u_2⊕u_1)$ , and $p(y_1^2│u1)=\frac{1}{2}∑_{u_2}p(y_1 | u_1⊕u_2)p(y_2 | u_2)$ . If you look at the way that the G is constructed, you can always find the outermost modified W’s that form a single one-step polarized channel. You only need to unpeel a certain subset of these connections and it reduces down to a self-similar problem. The exactly mechanisms is pretty hairy, but know that it is recursive and runs in O(n^2) time.

Visually, it’s a forward backward process.