Compression Algorithms

TagsCodingEE 276

Huffman Codes

The algorithm is pretty straightforward.

  1. Pick the two least likely variable outcomes and make them into a leaf node. One of them gets 1, the other gets 0.
  1. Combine the likelihoods of these outcomes into a super variable and add that back into the pile for consideration.
  1. Keep doing this until you run out of things to code. The tree you create is the huffman tree and the coding is optimal

Deriving the Huffman code

We can make this algorithm from first principles, where we just have access to individual probabilities which we sort in order p1,,pmp_1,\leq …, \leq p_m. We know that the lengths must go the other way: l1lml^*_1 \geq … \geq l^*_m. Otherwise, you can just swap two lengths and we will get a shorter expected code length.

Furthermore, as we will be dealing with binary codes, we know that lm1=lml^*_{m-1} = l^*_m. If this isn’t the case, say lm=lm1+1l_m = l_{m-1}+1, then we know that lml_m doesn’t have a sibling, which means that it can be “rotated” to form a shorter tree.

Finally, we know that 2li=1\sum 2^{-l_i} = 1 (again, because this is binary). So it makes sense to start thinking recursively. We can make lm1l_{m-1} and lml_m share a common parent node. This parent node has length l^m+1\hat{l}_{m+1}. We dopn’t know these numbers yet, but we can rewrit4e our objective:

pili=im2pili+pm1(l^m+1+1)+pm(l^m+1+1)\sum p_i l_i = \sum_i^{m-2}p_i l_i + p_{m-1}(\hat{l}_{m+1}+1) + p_{m}(\hat{l}_{m+1}+1)

which we can refactor into

pili=(im2pili+(pm1+pm)l^m+1)+(pm1+pm)\sum p_i l_i = \left(\sum_i^{m-2}p_i l_i + (p_{m-1} + p_m)\hat{l}_{m+1}\right) + (p_{m-1} + p_m)

and this just looks like the old form with one fewer term! (because the term outside of the parenthesis doesn’t matter)

The constraint is

im22li+22(l^m1+1)=im22li+2l^m1=1\sum_{i}^{m-2}2^{l_i} + 2 * 2^{-(\hat{l}_{m-1} + 1)} = \sum_{i}^{m-2}2^{l_i} + 2^{-\hat{l}_{m-1}} = 1

which again collapses down into the familiar form.

Huffman code analysis

The complexity of the Huffman code is wholly the sorting operation which takes place after every run. It takes nlognn\log n to sort initially, and logn\log n to insert the new value. So, overall it’s O(nlogn)O(n\log n), where nn is the size of the alphabet.

Ultimately, Huffman does a good job, but it’s not using a lot of information. It only uses the distribution of each character and encodes it separately. Longer sequences might have better structures to exploit, but because of the complexity depending on the alphabet size, we can’t just “chunk” together RV’s to form an alphabet χn\chi^n size (or else we risk blowing up the complexity).

Arithmetic Codes

Huffman codes are elegant, but they aren’t good at encoding long sequences (i.e. it makes the inductive assumption that everything is IID, and will encode single characters. Arithmetic codes are designed to encode long sequences with linear complexity, although the range of encoding is more limited.

The tl;dr is that arithmetic codes turn the code into binary decimal representation, which yields some very nice properties

Arithmetic Code setup

Let’s suppose that you had IID X1,,XnX_1,…,X_n that are sampled from some alphabet with size χ|\chi|. Your goal is to encode them into U1,U_1,… as optimally as possible. This means that the U1,U_1,… must be distributed uniformly (or else we can run U1,U_1,… through the encoder again and get something even better, etc.

The goal of an arithmetic code is to find a function FF that maps between X1,..,XnX_1,.., X_n (multinomial distribution) to a uniform distribution.

Binary Decimal Representation

The trick here is to turn the list of RV’s into numbers. How? Well, we can consider binary decimals: X=0.X1,,Xn...X_\infty = 0.X_1,…,X_n... and U=0.U1,U_\infty = 0.U_1,… We assume that these are infinite-precision decimals (and we’ll explain why).

The X=0.X1,,XnX_\infty = 0.X_1,…,X_n is some sort of distribution expressed in base χ|\chi| .

The U=0.U1,U_\infty = 0.U_1,… is a uniform distribution, because every UU is taken uniformly for maximal entropy.

Why do we use this setup? Well, there are a few nice properties

The Mapping Function

So you just need to find some function gg that maps the XX binary-decimal representation to the UU binary decimal representation. To do this, we start with the simplest constraint: p(g(x)u)=up(g(x)\leq u) = u. This is only true for uniform distributions. This yields p(x<g1(u))=up(x < g^{-1}(u)) = u, and we know that the left hand side is just FF, the CDF. So we have

F(g1(u))=uF(g^{-1}(u)) = u

and a natural choice for gg is

g=Fg = F

Is this well-defined? Well, we know that FF is monotonically increasing. We also know that every possible sequence of characters have a non-zero probability, which means that FF is always increasing and therefore invertible. And we showed int he last section that this FF has some really nice properties that make it easy to compute

Computing the Mapping Function (CDF)

We can compute the cumulative function recursively

An=F(X<x=0.x1,...xn+1...)=F(X<x=0.x1,...,xn...)+P(X1=x1,...,Xn=xn,Xn+1<xn+1)A_n = F(X < x = 0.x_1,...x_{n+1}...) = F(X < x = 0.x_1,...,x_n...) \\+ P(X_1 = x_1, ..., X_n = x_n, X_{n+1}<x_{n+1})

The first term you have by previously getting FF, and the second term is just a CDF function of XX. The second term can often be factored by the probability structure and computed directly.

For the upper bound, we realize that F(0.x1,,xn,111111)F(0.x_1,…,x_n,111111…) is just a marginalization across all Xn+1X_{n+1}…, which is equivalent to doing F(0.x1,,xn)+p(x1,,xn)F(0.x_1,…,x_n)+p(x_1,…,x_n).

The actual coding

The actual coding is surprisingly easy.

  1. Turn any sequence x1,,xnx_1,…,x_n into a number 0.x1,,xn0.x_1,…,x_n (expressed in base X|X|).
  1. Create lower bounds an=0.x1,,xna_n = 0.x_1,…,x_nan=0.x1,,xna^n = 0.x_1,…,x_n and upper bounds bn=0.x1,,xn+Xnb_n = 0.x_1,…,x_n + |X|^{-n}
  1. Recursively compute An=F(an)A_n = F(a_n) and Bn=F(bn)B_n = F(b_n) using our CDF formula
  1. On the new number line in UU space, consider [An,Bn][A_n, B_n]. Take the middle of the bounds, and use the entropy limit log1/p\lceil \log 1/p\rceil to decide the number of decimal places to send
    1. The middle of the bounds is arbitrary. Any number inside these bounds will map to the right number in the XX space.

The actual decoding

So this is pretty interesting. Here’s something interesting. When you’re dealing with an unknown bit, you can split it into two intervals: [0.x1,xn0,0.x1,xn01111][0.x_1,…x_n0, 0.x_1,…x_n01111…] and [0.x1,xn1,0.x1,xn11111][0.x_1,…x_n1, 0.x_1,…x_n11111…]. If you compute this, it’s a clean split. But if you’re in one interval, you know what the next bit will be!

The second realization is that these intervals map very cleanly into the UU space. You should convince yourself that it maps the clean 50/50 split into a (1-p) / p split. This makes sense, actually. The algorithm is therefore

  1. Split the intervals
  1. map the intervals to U space
  1. look at your current UU expansion 0.u1,,um0.u_1,…,u_m. This forms an interval [0.u1,,um,0.u1,,um11111][0.u_1,…,u_m, 0.u_1,…,u_m11111…]. If this interval is firmly in one of the intervals created in step 2, then you know what bit should go next in the data. Else, add more uu until we are firmly in an interval.
  1. Rinse, repeat.

Lempel-Ziv Coding

The idea is actually stupidly simple. Given a string, find out how to express the next characters using a substring that has been previously expressed.

Sliding Window LZ Algorithm

If WW is your window size and you’ve compressed up to i1i-1, then the next compression move is to find some i1Wji1i-1-W \leq j \leq i -1 such that xj+l=xi+lx_{j+l} = x_{i+l} for all l<kl<k, where kk is the largest possible. You denote this compression with two possible tuples: (F,P,L)(F, P, L) is when you specify the start point PP and the length of the match LL. (F,C)(F, C) is when you can’t find anything and need to specify an uncompressed character (usually close to the beginning). The FF is a flag bit that tells us what type of tuple we’re looking for.

Concretely, this is the algorithm

  1. Start with a sliding window WW that looks backward
  1. For the current position, see if you can find a non-trivial substring in the previous window. If you can, mark where it starts and the length kk
    1. if you can’t, just encode one character
  1. Slide the window forward kk steps until the previously encoded string is now in the window

To decode, you just construct the code. Because the code will only depend on things in the past, it is trivial to construct

Proof of optimality

Here’s the general sketch: as nn→\infty (the size of the coding chunk) as long as the window WW stays large enough, the probability of including a typical set in WW approaches 1. Why is this important? Well, it just means that we are arbitrarily likely to find what we need in WW.

What exactly is nn? Well, that’s our ideal chunk we move the window by. It might be that we use a shorter chunk, but the proof below shows that a chunk of size nn can be found with arbitraily high probability.

Tree-Structured LZ Algorithm

We parse the source sequence into the shortest strings that have not appeared so far. So this means that you’ll be building up longer and longer strings. And more specifically, for every element in this sequence, its prefixes must have all occurred earlier. More specifically, if you cut off the last character of the string, this shortened string must have occurred before, or else we wouldn’t have encountered this current element.

Therefore, we can send the parsed strings by giving the location of the prefix and the value of the last symbol. More on this later

Which coding algorithm should I use?

When coding, you need to consider three potentially relevant steps

  1. Learn statistics of the coding data
  1. Construct code
  1. Use the code to encode the data

So let’s compare the three previous methods

All are asymptotically optimal.