# 3F7 Information theory Tripos Revision

Posted by Jingbiao on April 10, 2021, Reading time: 3 minutes.

$$\require{amstext} \require{amsmath} \require{amssymb} \require{amsfonts}$$

Full notes can be viewed over here: rendered version and LaTex Source Code

### Entropy

• Calculation of entropy, joint entropy, conditional entropy, differential entropy for continuous pdf, relative entropy.
• Remember the entropy and the mutual information non-negative
• Usage of the chain rule for entropy.
• Entropy is the minimum/limit of the expected length of a uniquely decodable code. “Fundamental Limit of Compression”
• Fact: Conditionally Independent If X,Y conditionally independent given Z, then X and Y only independent if given Z. i.e. $$H(X|Y,Z) = H(X|Z) \qquad H(Y|X,Z) = H(Y|Z)$$ since independent variable does not reduce the uncertainty(entropy)
• Important Inequality: $\ln x \le x-1$

### Fano’s Inequality

• For any estimator $\hat{X}$ such that $X-Y-\hat{X}$, the probability of error $P_e$: $$P_e=Pr(\hat{X}\not = X) \ge \frac{H(X|Y) - 1}{\log(|\mathcal{X}|)}$$
• If we consider the error such that $\hat{X}\not = X$, therefore we can replace the $\mathcal{X}$ by $\mathcal{X} -1$, we would get a stronger version of the Fano’s inequality: $$P_e=Pr(\hat{X}\not = X) \ge \frac{H(X|Y) - 1}{\log(|\mathcal{X}| - 1)}$$

### Practical prefix-free coding

#### Arithmetic Coding

1. Start by dividing the interval $[0,1)$ (this means that 0 is inclusive and 1 is exclusive) using the probability of each symbol
• Assume the symbols with probability as following: $P(a) = 0.45$, $P(b)=0.3$, $p(c)=0.25$
• Then we would divide the interval into 3 groups.
• $[0,0.45)\quad[0.45,0.75)\quad[0.75, 1)$
• Each interval corresponds to each of the symbol
2. After dividing the interval, choose the interval according to the first symbol $X_1$, if symbol a then $[0,0.45)$ etc.
3. Then divide the interval $[0,0.45)$ into sub-intervals with the same probability distribution.
4. Repeat until finishes with the last symbol $X_n$, and keep a record of the final probability interval.
5. Find the largest dyadic interval of the form: $$[\frac{j}{2^l}, \frac{j+1}{2^l})$$ lies inside the final probability interval.
6. The codeword is the binary representation of $j$, and $l$ is the code length.
• The length of the binary codeword is at most $\lceil \log_2 \frac{1}{P(X_1,…,X_n)} \rceil + 1$. The expected code length L is therefore: $$L < H(X) + 2/N$$ where N is the length of the sequence.

#### Arithmetic vs Huffman

1. If the probability is not dyadic (integer powers of 1/2), then arithmetic uses lower number of blocks of symbols, Huffman requires longer blocks of symbols.
2. The complexity of the algorithm grows exponentially with k for Huffman, but linearly with arithmetic coding, therefore arithmetic coding is always more practical compared to Huffman.

### Channel Coding Theorem and Capacity

#### DMC - Discrete Memoryless Channel

• Channel Capacity: $$C = \max_{P_x}I(X;Y)$$
• Define the Channel Rate $R$, $R < C$ to enable probability of error $P_e^{(n)} \rightarrow 0$ as $n \rightarrow 0$.
• Use the channel $n$ times to transmit $k$ bits of message. $R = \frac{k}{n}$ bits/transmission. Total number of messages is $2^k=2^{nR}$
• Construct the codebook of length $2^{nR}$, each codeword generated i.i.d according to the $P_X$ that maximize the mutual information. Label the codebook as $\{ X^n(1),…,X^n(2^{nR})\}. • A message$i\in\{1,2,…,2^{nR}\} is transmitted as $X^n(i)$
• The receiver receives $Y^n$ and uses joint typicality decoding. It decodes the codeword $\hat{W}$ if $X^n(\hat{W})$ and $Y^n$ together is jointly typical w.r.t P_{XY}.
• If no such codeword $\hat{W}$ is found, then error arises.