Stationary and Memoryless Sources: Coding and Entropy

Author

Your Name

Published

January 28, 2025

Introduction

In our previous discussion, we introduced the concept of stationary and memoryless sources. These sources emit symbols (or letters) from a finite input alphabet, denoted by \(A\). Each symbol \(a_i \in A\) is associated with a probability \(P(a_i)\) of being emitted. When we encode these symbols using a code that maps from the input alphabet \(A\) to an output alphabet \(B\), we can define the expected length of the code.

Let \(A = \{a_1, a_2, \dots, a_K\}\) be the input alphabet, and let \(P_i\) be the probability of the symbol \(a_i\). Let \(L_i\) be the length of the codeword assigned to \(a_i\) by a given code. The expected length \(L\) of the code is the average length of the encoding of a single symbol, given by: \[L = \sum_{i=1}^{K} P_i L_i\]

Our primary goal is to find codes that minimize this expected length. This minimization is crucial for efficient data compression and is often referred to as minimizing the code rate.

Shannon’s Source Coding Theorem

Shannon’s Source Coding Theorem, also known as the First Shannon Theorem, establishes a fundamental lower bound on the expected length of any uniquely decodable code. It relates the expected length to the entropy of the source.

Let \(A\) be an input alphabet with a probability distribution \(P = \{P_1, P_2, \dots, P_K\}\), where \(P_i\) is the probability of the \(i\)-th symbol. Let \(B\) be an output alphabet with cardinality \(D\), and let \(L_i\) be the length of the codeword assigned to the \(i\)-th symbol by a uniquely decodable code. Then, the expected length of the code, given by \(\sum_{i=1}^{K} P_i L_i\), is at least the diadic entropy \(H_D(P)\) of the source, defined as: \[H_D(P) = -\sum_{i=1}^{K} P_i \log_D(P_i)\]

Proof. Proof. To prove this theorem, we will use the property of the natural logarithm: \(\ln(x) \leq x - 1\) for all \(x > 0\). This inequality implies that \(-\ln(x) \geq 1 - x\).

We aim to show that the difference between the expected length and the diadic entropy is non-negative: \[\sum_{i=1}^{K} P_i L_i - H_D(P) \geq 0\] We can rewrite this difference as: \[\begin{aligned} \sum_{i=1}^{K} P_i L_i - H_D(P) &= \sum_{i=1}^{K} P_i L_i + \sum_{i=1}^{K} P_i \log_D(P_i) \\ &= \sum_{i=1}^{K} P_i (L_i + \log_D(P_i)) \\ &= \sum_{i=1}^{K} P_i (\log_D(D^{L_i}) + \log_D(P_i)) \\ &= \sum_{i=1}^{K} P_i \log_D(D^{L_i} P_i) \\ &= \sum_{i=1}^{K} P_i \frac{\ln(D^{L_i} P_i)}{\ln(D)} \\ &= \frac{1}{\ln(D)} \sum_{i=1}^{K} P_i \ln(D^{L_i} P_i) \\ &= -\frac{1}{\ln(D)} \sum_{i=1}^{K} P_i \ln\left(\frac{1}{D^{L_i} P_i}\right) \end{aligned}\] Now, applying the inequality \(-\ln(x) \geq 1 - x\) with \(x = \frac{1}{D^{L_i} P_i}\), we get: \[\begin{aligned} -\frac{1}{\ln(D)} \sum_{i=1}^{K} P_i \ln\left(\frac{1}{D^{L_i} P_i}\right) &\geq \frac{1}{\ln(D)} \sum_{i=1}^{K} P_i \left(1 - \frac{1}{D^{L_i} P_i}\right) \\ &= \frac{1}{\ln(D)} \left(\sum_{i=1}^{K} P_i - \sum_{i=1}^{K} \frac{1}{D^{L_i}}\right) \\ &= \frac{1}{\ln(D)} \left(1 - \sum_{i=1}^{K} \frac{1}{D^{L_i}}\right) \end{aligned}\] Since the code is uniquely decodable, the Kraft-McMillan inequality states that \(\sum_{i=1}^{K} \frac{1}{D^{L_i}} \leq 1\). Therefore, \(1 - \sum_{i=1}^{K} \frac{1}{D^{L_i}} \geq 0\). Also, since \(D \geq 2\), we have \(\ln(D) > 0\). Thus, the entire expression is greater than or equal to zero, which proves the theorem. ◻

The First Shannon Theorem establishes that the expected length of any uniquely decodable code cannot be less than the entropy of the source. This theorem provides a fundamental limit on data compression.

Note: While the proof is mathematically rigorous, it may not provide an intuitive understanding of the theorem. We will explore this further with examples in subsequent sections.

Original Formulation of Shannon’s Theorem

It’s insightful to examine how Shannon originally presented his theorem in his seminal paper. This formulation emphasizes the relationship between source entropy and channel capacity, which is particularly relevant in communication systems.

Let a source have entropy \(H\) (bits per symbol) and a channel have capacity \(C\) (bits per second). Then it is possible to encode the output of the source in such a way as to transmit at an average rate \(\frac{C}{H} - \epsilon\) symbols per second over the channel where \(\epsilon\) is arbitrarily small. It is not possible to transmit at an average rate greater than \(\frac{C}{H}\).

This original statement highlights the practical implications of the theorem. It states that the maximum achievable transmission rate over a channel is limited by the ratio of the channel’s capacity to the source’s entropy. Specifically:

  • Channel Capacity (\(C\)): Represents the maximum rate at which information can be reliably transmitted over a communication channel, measured in bits per second.

  • Source Entropy (\(H\)): Quantifies the average amount of information per symbol produced by the source, measured in bits per symbol.

  • Transmission Rate: The rate at which symbols are transmitted over the channel, measured in symbols per second.

The theorem asserts that we can achieve a transmission rate arbitrarily close to \(\frac{C}{H}\), but we cannot exceed this rate. This is a crucial result for understanding the limits of data transmission and compression.

Equivalence: While the original formulation focuses on transmission rates and channel capacity, it is fundamentally equivalent to the formulation we presented earlier in [thm:shannon_source_coding], which focuses on the expected length of a code and the source entropy. The two formulations are different perspectives on the same underlying principle. Engineers often prefer the original formulation due to its direct relevance to communication system design.

Examples and Optimality

A code is considered optimal if its expected length is the smallest possible among all uniquely decodable codes for a given source. Ideally, we would like to achieve an expected length that is equal to the entropy of the source. However, this is not always possible, particularly when the lengths of the codewords must be integers.

A code is optimal if there is no other uniquely decodable code with a smaller expected length for the given source.

Consider a source with an input alphabet \(A = \{a, b\}\) and an output alphabet \(B = \{0, 1\}\). The probabilities of the symbols are \(P(a) = \frac{2}{3}\) and \(P(b) = \frac{1}{3}\). The diadic entropy of this source is: \[H_2(P) = -\left(\frac{2}{3}\log_2\left(\frac{2}{3}\right) + \frac{1}{3}\log_2\left(\frac{1}{3}\right)\right) \approx 0.918\] Since the entropy is less than 1, it might seem that we could achieve an expected length less than 1. However, the shortest possible codeword length is 1. If we assign a codeword of length 1 to both symbols (e.g., \(a \rightarrow 0\) and \(b \rightarrow 1\)), the expected length of the code is: \[L = \frac{2}{3}(1) + \frac{1}{3}(1) = 1\] This code is optimal because we cannot assign codewords shorter than length 1. However, the expected length of 1 is greater than the entropy of approximately 0.918. This example demonstrates that the entropy is a lower bound that is not always achievable.

Even though the entropy provides a lower bound on the expected length, it is not always possible to construct a code that achieves this bound. The constraint that codeword lengths must be integers often prevents us from reaching the entropy.

Shannon Codes

Shannon codes provide a method for constructing uniquely decodable codes by assigning codeword lengths based on the probabilities of the source symbols. The core idea is to make the codeword length for a symbol inversely proportional to the logarithm of its probability.

For a source with alphabet \(A = \{a_1, a_2, \dots, a_K\}\) and corresponding probabilities \(P_1, P_2, \dots, P_K\), the length \(L_i\) of the codeword for symbol \(a_i\) in a Shannon code is given by: \[L_i = \lceil -\log_D(P_i) \rceil\] where \(D\) is the cardinality of the output alphabet \(B\), and \(\lceil x \rceil\) denotes the ceiling function (the smallest integer greater than or equal to \(x\)).

Given a source with probabilities \(P_1, P_2, \dots, P_K\), there always exists a uniquely decodable code with codeword lengths \(L_i = \lceil -\log_D(P_i) \rceil\). Moreover, this code is a prefix code.

Proof. Proof. To prove the existence of a uniquely decodable code with the specified lengths, we need to show that the Kraft-McMillan inequality holds: \[\sum_{i=1}^{K} \frac{1}{D^{L_i}} \leq 1\] Let’s express \(L_i\) as \(L_i = \lceil -\log_D(P_i) \rceil = -\log_D(P_i) + \beta_i\), where \(0 \leq \beta_i < 1\). Substituting this into the inequality, we get: \[\begin{aligned} \sum_{i=1}^{K} \frac{1}{D^{L_i}} &= \sum_{i=1}^{K} \frac{1}{D^{-\log_D(P_i) + \beta_i}} \\ &= \sum_{i=1}^{K} \frac{1}{D^{-\log_D(P_i)} D^{\beta_i}} \\ &= \sum_{i=1}^{K} \frac{1}{\frac{1}{P_i} D^{\beta_i}} \\ &= \sum_{i=1}^{K} \frac{P_i}{D^{\beta_i}} \end{aligned}\] Since \(0 \leq \beta_i < 1\), we have \(D^0 \leq D^{\beta_i} < D^1\), which means \(1 \leq D^{\beta_i} < D\). Therefore, \(\frac{1}{D} < \frac{1}{D^{\beta_i}} \leq 1\). Thus, \[\sum_{i=1}^{K} \frac{P_i}{D^{\beta_i}} \leq \sum_{i=1}^{K} P_i = 1\] The Kraft-McMillan inequality holds, which implies that a prefix code with these lengths exists. ◻

Consider an alphabet \(A = \{a, b\}\) with output alphabet \(B = \{0, 1\}\) (i.e., \(D=2\)). The probabilities are \(P(a) = 1 - \frac{1}{32} = \frac{31}{32}\) and \(P(b) = \frac{1}{32}\). The codeword lengths are: \[\begin{aligned} L_a &= \lceil -\log_2\left(\frac{31}{32}\right) \rceil = \lceil -(-0.044) \rceil = \lceil 0.044 \rceil = 1 \\ L_b &= \lceil -\log_2\left(\frac{1}{32}\right) \rceil = \lceil -(-5) \rceil = \lceil 5 \rceil = 5 \end{aligned}\] A possible Shannon code is \(a \rightarrow 0\) and \(b \rightarrow 10000\).

Shannon codes are guaranteed to be uniquely decodable and are prefix codes. However, they may not always be optimal in terms of minimizing the expected length.

Suboptimal Codes

While Shannon codes are guaranteed to be uniquely decodable, they are not always optimal in the sense of achieving the minimum possible expected length. Codes that have an expected length greater than the entropy but within a certain bound are often referred to as suboptimal.

A code is considered suboptimal if its expected length \(L\) satisfies the following condition: \[H_D(P) \leq L < H_D(P) + 1\] where \(H_D(P)\) is the diadic entropy of the source, and \(D\) is the cardinality of the output alphabet.

For a Shannon code, the expected length \(L\) satisfies: \[H_D(P) \leq L < H_D(P) + 1\]

Proof. Proof. From [def:shannon_code_length], we know that the length of the codeword for symbol \(a_i\) in a Shannon code is given by \(L_i = \lceil -\log_D(P_i) \rceil\). We can express this as \(L_i = -\log_D(P_i) + \beta_i\), where \(0 \leq \beta_i < 1\). The expected length \(L\) of the Shannon code is: \[\begin{aligned} L &= \sum_{i=1}^{K} P_i L_i \\ &= \sum_{i=1}^{K} P_i (-\log_D(P_i) + \beta_i) \\ &= -\sum_{i=1}^{K} P_i \log_D(P_i) + \sum_{i=1}^{K} P_i \beta_i \\ &= H_D(P) + \sum_{i=1}^{K} P_i \beta_i \end{aligned}\] Since \(0 \leq \beta_i < 1\) for all \(i\), and \(\sum_{i=1}^{K} P_i = 1\), we have: \[0 \leq \sum_{i=1}^{K} P_i \beta_i < \sum_{i=1}^{K} P_i = 1\] Therefore, \[H_D(P) \leq L < H_D(P) + 1\] This proves that Shannon codes are suboptimal, as their expected length is always within one unit of the source entropy. ◻

The suboptimality of Shannon codes means that while they are easy to construct and guarantee unique decodability, there might exist other codes with a smaller expected length. The bound \(L < H_D(P) + 1\) provides a measure of how far the expected length of a Shannon code can be from the optimal value.

Shannon-Fano Codes

Shannon-Fano codes offer a top-down approach to constructing variable-length codes. The method involves recursively partitioning the set of symbols based on their probabilities, aiming to create a balanced binary tree.

The Shannon-Fano coding algorithm constructs a code by recursively dividing the set of symbols into two subsets such that the sum of probabilities in each subset is as close as possible. This process is repeated for each subset until each subset contains only one symbol.

Alphabet \(A = \{a_1, a_2, \dots, a_K\}\) with probabilities \(P_1, P_2, \dots, P_K\) sorted in decreasing order. Assign a codeword to the single symbol in \(A\) Find index \(J\) that minimizes \(\left| \sum_{i=1}^{J} P_i - \sum_{i=J+1}^{K} P_i \right|\) \(A_1 \gets \{a_1, \dots, a_J\}\) and \(P_1 \gets \{P_1, \dots, P_J\}\) \(A_2 \gets \{a_{J+1}, \dots, a_K\}\) and \(P_2 \gets \{P_{J+1}, \dots, P_K\}\) Assign ‘0’ as the prefix for codewords in \(A_1\) Assign ‘1’ as the prefix for codewords in \(A_2\)

Consider an alphabet \(A = \{a, b, c, d\}\) with probabilities \(P(a) = \frac{1}{2}\), \(P(b) = \frac{1}{6}\), \(P(c) = \frac{1}{6}\), and \(P(d) = \frac{1}{6}\).

  1. Initial Split: We split the alphabet into \(\{a\}\) and \(\{b, c, d\}\).

  2. Second Split: We split \(\{b, c, d\}\) into \(\{b, c\}\) and \(\{d\}\).

  3. Third Split: We split \(\{b, c\}\) into \(\{b\}\) and \(\{c\}\).

The resulting code is: \(a \rightarrow 0\), \(b \rightarrow 100\), \(c \rightarrow 101\), and \(d \rightarrow 11\).

Shannon-Fano code tree for Example [ex:shannon_fano_example1]

Consider an alphabet \(A = \{a, b, c, d, e, f\}\) with probabilities \(P(a) = \frac{40}{100}\), \(P(b) = \frac{18}{100}\), \(P(c) = \frac{15}{100}\), \(P(d) = \frac{13}{100}\), \(P(e) = \frac{10}{100}\), and \(P(f) = \frac{4}{100}\).

  1. Initial Split: The best split is between \(\{a, b\}\) and \(\{c, d, e, f\}\).

  2. Second Split: We split \(\{a, b\}\) into \(\{a\}\) and \(\{b\}\).

  3. Third Split: We split \(\{c, d, e, f\}\) into \(\{c\}\) and \(\{d, e, f\}\).

  4. Fourth Split: We split \(\{d, e, f\}\) into \(\{d\}\) and \(\{e, f\}\).

  5. Fifth Split: We split \(\{e, f\}\) into \(\{e\}\) and \(\{f\}\).

The resulting code has an expected length of 2.41.

The Shannon-Fano algorithm aims to balance the probabilities in each split. However, finding the absolute best split is an NP-complete problem (subset sum). Therefore, the algorithm uses a greedy approach to find a reasonably good split.

Huffman Codes

Huffman coding is a bottom-up approach to constructing optimal prefix codes. Unlike Shannon-Fano coding, which starts by splitting the set of symbols, Huffman coding starts by merging the two least probable symbols. This process is repeated until all symbols are merged into a single tree.

The Huffman coding algorithm constructs a code by iteratively merging the two symbols with the lowest probabilities. The merged symbols are treated as a new symbol with a probability equal to the sum of the probabilities of the merged symbols. This process continues until all symbols are merged into a single root node.

Alphabet \(A = \{a_1, a_2, \dots, a_K\}\) with probabilities \(P_1, P_2, \dots, P_K\). Initialize a priority queue \(Q\) with all symbols in \(A\), ordered by their probabilities. Remove the two symbols \(x\) and \(y\) with the lowest probabilities from \(Q\). Create a new node \(z\) with children \(x\) and \(y\), and probability \(P_z = P_x + P_y\). Insert \(z\) into \(Q\). The remaining node in \(Q\) is the root of the Huffman tree. Assign ‘0’ to the left branch and ‘1’ to the right branch of each node in the tree. The codeword for each symbol is the sequence of bits from the root to the leaf node representing the symbol.

Consider the alphabet \(A = \{a, b, c, d, e, f\}\) with probabilities \(P(a) = \frac{40}{100}\), \(P(b) = \frac{18}{100}\), \(P(c) = \frac{15}{100}\), \(P(d) = \frac{13}{100}\), \(P(e) = \frac{10}{100}\), and \(P(f) = \frac{4}{100}\).

  1. Merge e and f: Create a new node with probability \(\frac{14}{100}\).

  2. Merge d and (e,f): Create a new node with probability \(\frac{27}{100}\).

  3. Merge b and c: Create a new node with probability \(\frac{33}{100}\).

  4. Merge (d,e,f) and (b,c): Create a new node with probability \(\frac{60}{100}\).

  5. Merge a and (b,c,d,e,f): Create the root node.

The resulting Huffman code is: \(a \rightarrow 0\), \(b \rightarrow 100\), \(c \rightarrow 101\), \(d \rightarrow 111\), \(e \rightarrow 1101\), and \(f \rightarrow 1100\). The expected length of this code is 2.34.

Huffman code tree for Example [ex:huffman_code_example]

Huffman codes are optimal, meaning that for a given set of probabilities, there is no other uniquely decodable code with a smaller expected length. This optimality makes Huffman coding a popular choice for data compression.

Conclusion

In this lecture, we have explored fundamental concepts in source coding, focusing on the theoretical limits and practical coding techniques. We began with Shannon’s Source Coding Theorem, which establishes a lower bound on the expected length of any uniquely decodable code, linking it to the entropy of the source. We then examined several coding methods, each with its own characteristics and performance:

  • Shannon Codes: These codes are constructed by assigning codeword lengths based on the ceiling of the negative logarithm of the symbol probabilities. They are guaranteed to be uniquely decodable and are prefix codes, but they are suboptimal, with an expected length that can be up to one bit more than the source entropy.

  • Shannon-Fano Codes: These codes use a top-down approach, recursively splitting the set of symbols into subsets with approximately equal probabilities. While they are relatively simple to implement, they are not guaranteed to be optimal.

  • Huffman Codes: These codes use a bottom-up approach, iteratively merging the two least probable symbols. Huffman codes are optimal, meaning they achieve the minimum possible expected length for a given source.

  • Shannon’s Source Coding Theorem: Provides a fundamental lower bound on the expected length of any uniquely decodable code, which is the source entropy.

  • Entropy: A measure of the average information content of a source, which serves as a benchmark for the performance of compression algorithms.

  • Optimal Codes: Codes that achieve the minimum possible expected length for a given source.

  • Suboptimal Codes: Codes whose expected length is greater than the entropy but within a certain bound.

  • Prefix Codes: Codes in which no codeword is a prefix of any other codeword, ensuring unique decodability.

In our next lecture, we will revisit Shannon codes and discuss their asymptotic optimality, which provides a different perspective on their performance. We will also introduce another coding technique that takes a different approach to source coding. This will further expand our understanding of the various methods available for efficient data compression.