## 2022-06-25

### Information theory 2: source coding

6.9k words, including equations (~36min)

In the previous post, we saw the basic information theory model:

If we have no noise in the channel, we don't need channel coding. Therefore the above model simplifies to

and the goal is to minimise $$n$$ - that is, minimise the number of symbols we need to send - without needing to worry about being robust to any errors.

Here's one question to get started: imagine we're working with a compression function $$f_e$$ that acts on length-$$n$$ strings (that is, sequences of symbols) with some arbitrary alphabet size $$A$$ (that is, $$A$$ different types of symbols). is it possible to build an encoding function $$f_e$$ that compresses every possible input? Clearly not; imagine that it took every length-$$n$$ string to a length-$$m$$ string using the same alphabet, with $$m < n$$. Then we'd have $$A^m$$ different available codewords that would need to code for $$A^n > A^m$$ different messages. By the pigeonhole principle, there must be at least one codeword that codes for more than one message. But that means that if we see this codeword, we can't be sure what it codes for, so we can't recover the original with certainty.

Therefore, we have a choice: either:

• do lossy compression, where every message shrinks in size but we can't recover information perfectly; or
• do lossless compression, and hope that more messages shrink in size than expand in size.

This is obvious with lossless compression, but applies to both: if you want to do them well, you generally need a probability model for what your data looks like, or at least something that approximates one.

## Terminology

When we talk about a "code", we just mean something that maps messages (the $$Z$$ in the above diagram) to a sequence of symbols. A code is nonsingular if it associates every message with a unique code.

A symbol code is a code where each symbol in the message maps to a codeword, and the code of a message is the concatenation of the codewords of the symbols that it is made of.

A prefix code is a code where no codeword is a prefix of another codeword. They are also called instantaneous codes, because when decoding, you can decode a codeword to a symbol immediately when you reach a point where the some prefix of the code corresponds to a codeword.

## Useful basic results in lossless compression

### Kraft's inequality

Kraft's inequality states that a prefix code with an alphabet of size $$D$$ and code words of lengths $$l_1, l_2, \ldots, l_n$$ satisfies $$\sum_{i=1}^n D^{-l_i} \leq 1,$$$and conversely that if there is a set of lengths $${l_1, \ldots, l_n}$$ that satisfies the above inequality, there exists a prefix code with those codeword lengths. We will only prove the first direction: that all prefix codes satisfy the above inequality. Let $$l = \max_i l_i$$ and consider the tree with branching factor $$D$$ and depth $$l$$. This tree has $$D^l$$ nodes on the bottom level. Each codeword $$x_1x_2...x_c$$ is the node in this tree that you get to by choosing the $$d_i$$th branch on the $$i$$th level where $$d_i$$ is the index of symbol $$x_i$$ in the alphabet. Since it must be a prefix code, no node that is a descendant of a node that is a codeword can be a codeword. We can define our "budget " as the $$D^l$$ nodes on the bottom level of the tree, and define the "cost" of each codeword as the number of nodes on the bottom level of the tree that are descendants of the node. The node with length $$l$$ has cost 1, and in general a codeword at level $$l_i$$ has cost $$D^{l - l_i}$$. From this, and the prefix-freeness, we get $$\sum_i D^{l - l_i} \leq D^l$$$ which becomes the inequality when you divide both sides by $$D^l$$.

### Gibbs' inequality

Gibbs' inequality states that for any two probability distributions $$p$$ and $$q$$, $$-\sum_i p_i \log p_i \leq - \sum_i p_i \log q_i$$$which can be written using the relative entropy $$D$$ (also known as the KL distance/divergence) as $$\sum_i p_i \log \frac{p_i}{q_i} = D(p||q) \geq 0.$$$ This can be proved using the log sum inequality. The proof is boring.