Data Compression

TagsCodingEE 276

What is a Code?

A code is something that maps the domain of XX to a set of finite symbols. As the simplest code, you might give every element of the domain an index ii and use that as your code. We use C(x)C(x) as the code, and l(x)l(x) as the length of C(x)C(x) (it’s a bit confusing, I know). We pick the symbols from an alphabet D\mathcal{D}. Typically, we use D=2D = 2 for binary.

We can extend a code by concatenating the code together. This creates a stream of information.

The expected length of a code is just ExX[l(x)]E_{x\sim X}[l(x)].

The rate of an encoder is the number of bits per sample. It is an average quantity, and it is bounded by entropy H(X)H(X).

Getting a good code (what to look for)

  1. A code must be bijective per character (i.e. no two inputs can map to the same output. This is called nonsingular.
  1. A code must be decodable in sequence. This is known as uniquely decodable. Now, this may require you to look at the whole string and work your way backwards. Can we do better?
  1. No codeword is a prefix af another. This is known as a prefix code. (or instantaneous code)

Code philosophy

A code and a compression is a way to take some distribution and turn it into something maximally entropic, which reduces the amount of bits needed to represent it.

Kraft Inequality

So instantaneous codes are really good, but they have a fundamental limit. In other words, you will create prefix collisions at a certain point. But at what point?

Kraft Inequality Definition 🐧

The Kraft inequality states that for an alphabet of size DD and codeword lengths ll, the following must be true.

iDli1\sum_iD^{-l_i}\leq 1

Conversely, this is also true: if there is a set of lengths that satisfy this inequality, there must exist an instantaneous code with these word lengths

Basically this is a necessary and sufficient condition for instantaneous codes.

Binary Kraft is Tight in Optimality 🚀

Essentially, if D=2D = 2, then all optimal codes have

i2li=1\sum_i2^{-l_i}= 1

This is true by proof through a tree. Remember that 2li2^{-l_i} represents the proportion of the tree children that is “spawned” by this lil_i. We make an inequality if we have a node that doesn’t have a filled leaf or a child node. But this means that the node is XY→X→Y, so we can “rotate” such that we have X,YX, Y share a parent. So there is no excuse to have suboptimality

Non-binary Kraft may not be tight in optimality

If D>2D > 2, then our previous theorem doesn’t hold. It’s pretty simple. You can have some nodes without a full span of leaves (typically you want to put as many children as possible, but you may simply not fill all of the leaves). “rotations” aren’t a thing past D>2D > 2.

Optimal Codes

Given that the Kraft Inequality is necessary and sufficient, we can use it as a constraint to find optimal but Instantaneous codes. In other words, we want to minimize

subject to

The tl;dr is that we want to find the shortest code given that we satisfy a certain property.

Implication of Constraint

So from the Craft’s inequality, we established that all optimal codes will have equality in the constraint (Dli=1\sum D^{-l_i} = 1), which helps. We can use a Lagrange derivation (see below) or we can do something elegant that tells us a lot. Consider

ipiliH(X)\sum_i p_il_i - H(X)

Now, the cool part is that DliD^{-l_i} is a distribution qiq_i, as it sums to 1. To get lil_i we compute log(1/qi)\log (1/q_i). Therefore, we have

ipilog1qiipilog1pi=D(pq)0\sum_ip_i\log \frac{1}{q_i} - \sum_i p_i \log \frac{1}{p_i} = D(p || q) \geq 0

As such we must conclude that ipiliH(X)\sum_i p_i l_i \geq H(X), and this is a pretty significant point: it means that among the prefix codes, we can never reach below entropy in confusion.

Touchback to AEP: with the typical sets, we arrived at the same bound. This is very interesting!
Lagrange derivation (extra)

As you know, we can use the Lagrange multiplier to do constrained optimization

For every lil_i we get

which gets us the solution

which essentially means that you should spend a longer code (smaller LHS) if the occurence is more rare (pip_i small), which makes sense.

After finding λ\lambda, we get that the optimal code length is just

which is actually super elegant. Because if you plug it back into the expected length, we get back the entropy!

The optimal coding scheme ⭐

But this previous result also implicates something critical: the KL divergence means that if p=qp = q, then the gap is tight. But recall the definition of qq. In this case, Dli=piD^{-l_i} = p_i, which means that li=logpil_i = -\log p_i. This is an interesting result: it means that the optimal coding scheme assigns a length to the code according to the probabilities in the distribution.

Integer Rounding

The problem is that we can’t guarentee that logpi-\log p_i is an integer. So we have to take the ceiling function (rounding down can violate the constraint). How much does that cost us? Well, here’s a general fact:

pif(xi)<pif(xi)+1\sum p_i\lceil f(x_i)\rceil < \sum p_i f(x_i) + 1

This should be pretty obvious why it’s the case. Therefore, the true length satisfies

H(X)LH(X)+1H(X) \leq L^* \leq H(X) + 1
👉
If you are not dealing with base 2, the entropies are taken with respect to the base of the code alphabet

Supervariables ⭐

This is why encoding multiple streams is good

So this one bit can be bad news, especially if H(X)H(X) isn’t that large. But here’s the ultimate realization: it doesn’t matter how bit H(X)H(X) is…the 1 stays the same! So what if we grouped a bunch of RV’;s X1,,XnX_1,…,X_n together into a supervariable? Then, we have

H(Xn)nLH(Xn)+1H(X^n) \leq nL^*\leq H(X^n)+1

which simplifies down to

H(X)L<H(X)+1nH(X) \leq L^* < H(X) + \frac{1}{n}

So we can get pretty close as long as we start encoding multiple variables. But this isn’t a perfect solution. By putting chunks of variables together, we are increasing the complexity exponentially, which means that certain algorithms won’t be able to handle encoding in an efficient way (especially Huffman)

Wrong Codes (extra)

What if you used a coding scheme that is optimal for qq under a new distribtuion pp? well, it turns out that you can bound it with

which is actually pretty cool, as it gives you a pretty tight interval.

Kraft Inequality for Uniquely Decodable Codes

So the Kraft was working with Instantaneous codes (baked into the tree proof), but as it turns out, any uniquely decodable codes will also satisfy Kraft.