Uniquely Decodable Codes and Prefix Codes

Author

Your Name

Published

January 28, 2025

Introduction

In our previous discussion, we introduced the concepts of uniquely decodable (UD) codes and prefix codes. We consider a code as a function $\varphi: A \to B^*$, where $A$ represents the input alphabet with cardinality $|A| = K$, and $B$ denotes the output alphabet with cardinality $|B| = D$. We established the definitions for both UD codes and prefix codes, and we noted a crucial property: if a code is a prefix code, it is inherently uniquely decodable without any delay.

A key question arises when we seek to achieve maximal compression of information: is it essential to consider the entire family of uniquely decodable codes, or can we effectively limit our focus to prefix codes? This question forms the central theme of our current investigation.

Notation

To ensure clarity and precision in our discussion, we introduce the following notation:

$\varphi(a)$: Represents the encoding of a letter $a$ belonging to the input alphabet $A$.
$L_i$: Denotes the length of the encoding of the $i$-th letter, $a_i$, in the input alphabet $A$. We also denote this as $\ell(a_i)$.

This notation will be consistently used throughout the document.

Kraft-McMillan Theorem (Inverse Theorem)

The Kraft-McMillan theorem, also known as the inverse theorem, establishes a fundamental constraint on the lengths of codewords in a uniquely decodable code.

If a code $\varphi: A \to B^*$ is uniquely decodable, then \[\sum_{i=1}^K D^{-L_i} \le 1,\] where $D = |B|$ is the cardinality of the output alphabet $B$, and $L_i$ is the length of the encoding of the $i$-th letter of the input alphabet $A$.

Interpretation

The theorem’s inequality provides a crucial insight into the relationship between codeword lengths and unique decodability. The term $D^{-L_i}$ decreases as the length $L_i$ increases.

Large $L_i$ values: If all $L_i$ values are large, the terms $D^{-L_i}$ become small, making it more likely that the sum $\sum_{i=1}^K D^{-L_i}$ will be less than or equal to 1. This corresponds to using longer codewords, which can be less efficient in terms of compression but ensures unique decodability.
Small $L_i$ values: Conversely, if all $L_i$ values are very small, the terms $D^{-L_i}$ become larger, potentially causing the sum to exceed 1. This indicates that the codewords are too short, and the code may not be uniquely decodable.

For instance, consider an input alphabet $A$ with $|A| = 20$ and an output alphabet $B$ with $|B| = 2$. If we attempt to assign codes of length 2 to each letter in $A$, we would only have $2^2 = 4$ distinct codes available. This is insufficient to create an injective function $\varphi$, which is a necessary condition for unique decodability. The Kraft-McMillan theorem formalizes this intuition, stating that for a code to be uniquely decodable, the lengths of the encodings must be sufficiently large to satisfy the inequality.

Implications

The Kraft-McMillan theorem has several important implications:

Non-UD Code Detection: The theorem can be used to prove that a code is not uniquely decodable. If $\sum_{i=1}^K D^{-L_i} > 1$, then the code is not UD. This is often easier than directly finding two different messages with the same encoding.
Code Characteristics:
- If $\sum_{i=1}^K D^{-L_i} > 1$, the code is not UD, indicating that the codewords are too short.
- If $\sum_{i=1}^K D^{-L_i}$ is very small (close to 0), the code is likely redundant, meaning that it uses unnecessarily long encodings.
- If $\sum_{i=1}^K D^{-L_i}$ is close to 1, the code is likely a good and interesting UD code, in the sense that it is parsimonious, using the minimum quantity of letters needed for unique decodability.

Historical Note

It is noteworthy that the Kraft-McMillan theorem was proved in 1958, a decade after Shannon’s groundbreaking work on information theory. Shannon’s original proof of his theorem was complex and less convincing. The Kraft-McMillan theorem provides a simpler and more elegant approach to proving Shannon’s theorem, highlighting its significance in the field.

Proof for Prefix Codes

We will first prove the Kraft-McMillan theorem for the special case where $\varphi$ is a prefix code. This proof provides an intuitive understanding of the theorem’s underlying principles.

Proof. Proof. Without loss of generality, we assume that the input alphabet $A$ is sorted such that $L_1 \le L_2 \le \dots \le L_K$. We can represent the code $\varphi$ as a $D$-ary tree, where each branch represents a symbol from the output alphabet $B$. The root of the tree represents the empty string, and each node at depth $l$ represents a string of length $l$.

Consider a complete $D$-ary tree of depth $L = L_K$, where $L_K$ is the length of the longest codeword. Since $\varphi$ is a prefix code, the codewords correspond to the leaves of the tree.

Let $A = \{a, b, c, d\}$ and $B = \{0, 1, 2\}$. Suppose the code $\varphi$ is defined as follows:

$\varphi(a) = 0$
$\varphi(b) = 11$
$\varphi(c) = 12$
$\varphi(d) = 211$

The corresponding tree representation is shown in 1.

Tree representation of the prefix code in Example [ex:prefix_code_tree].

Since the codeword for $a_1$ has length $L_1$, the path from the root to the node labeled $a_1$ has length $L_1$. Because $\varphi$ is a prefix code, no other codewords can be descendants of the node labeled $a_1$.

The number of leaves in the subtree rooted at the node labeled $a_1$ is $D^{L_K - L_1}$. These leaves are "forbidden" because of the encoding of $a_1$.

Similarly, for each $a_i$, the number of forbidden leaves is $D^{L_K - L_i}$. Since the subtrees rooted at each $a_i$ are disjoint, the total number of forbidden leaves for the first $K-1$ letters is: \[\sum_{i=1}^{K-1} D^{L_K - L_i}.\] The total number of leaves in the complete $D$-ary tree of depth $L_K$ is $D^{L_K}$. Since we need at least one leaf for the last letter $a_K$, we have: \[D^{L_K} - \sum_{i=1}^{K-1} D^{L_K - L_i} \ge 1.\] Dividing both sides by $D^{L_K}$, we get: \[1 - \sum_{i=1}^{K-1} D^{-L_i} \ge D^{-L_K}.\] Rearranging the terms, we obtain: \[\sum_{i=1}^K D^{-L_i} \le 1.\] ◻

Direct Theorem

The direct theorem provides a converse to the Kraft-McMillan theorem, showing that the inequality is not only a necessary condition but also a sufficient condition for the existence of a prefix code.

Let $A$ be an input alphabet with $|A| = K$ and $B$ be an output alphabet with $|B| = D$. Given a sequence of natural numbers $L_1, L_2, \dots, L_K$ such that $\sum_{i=1}^K D^{-L_i} \le 1$, there exists a prefix code $\varphi: A \to B^*$ such that the lengths of the encodings are exactly $L_1, L_2, \dots, L_K$.

Proof. Proof. The proof is constructive and follows the same logic as the proof of the inverse theorem for prefix codes. We construct the prefix code by assigning codewords to letters in increasing order of their lengths.

We start with a complete $D$-ary tree of depth $L_K$. We choose a path of length $L_1$ and label the corresponding node with $a_1$. This removes $D^{L_K - L_1}$ leaves from further consideration.

We then choose the next available path of length $L_2$ and label it with $a_2$. We continue this process until all letters are assigned codewords. Since $\sum_{i=1}^K D^{-L_i} \le 1$, we are guaranteed to have enough leaves to assign codewords to all letters without creating any prefixes. ◻

Let $A = \{a, b, c, d, e\}$ and $B = \{0, 1\}$. Consider the lengths $L_1 = 1, L_2 = 3, L_3 = 3, L_4 = 3, L_5 = 3$. We have: \[\sum_{i=1}^5 2^{-L_i} = \frac{1}{2} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8} = 1.\] Since the inequality is satisfied, we can construct a prefix code with these lengths:

$\varphi(a) = 0$
$\varphi(b) = 100$
$\varphi(c) = 101$
$\varphi(d) = 110$
$\varphi(e) = 111$

General Proof of the Inverse Theorem

Now, we present the general proof of the Kraft-McMillan theorem, which applies to all uniquely decodable codes, not just prefix codes.

Proof. Proof. Without loss of generality, assume that the input alphabet $A$ is sorted such that $L_1 \le L_2 \le \dots \le L_K$.

Let $N_{n,h}$ be the number of strings of length $n$ over the input alphabet $A$ that have an encoding of length $h$. The total number of strings of length $n$ over $A$ is $K^n$.

Since the code is uniquely decodable, the number of strings of length $n$ over $A$ with an encoding of length $h$ cannot exceed the number of strings of length $h$ over $B$, which is $D^h$. Thus, $N_{n,h} \le D^h$.

Consider the sum $\sum_{i=1}^K D^{-L_i}$. We want to show that this sum is less than or equal to 1. Instead, we will study the sum raised to the power of $n$: \[\left( \sum_{i=1}^K D^{-L_i} \right)^n.\] When we expand this expression, we get a sum of terms of the form $D^{-(n_1 L_1 + n_2 L_2 + \dots + n_K L_K)}$, where $n_1 + n_2 + \dots + n_K = n$. Each term corresponds to the length of the encoding of a string of length $n$ over $A$.

For example, the term $D^{-nL_1}$ corresponds to the encoding of the string $a_1 a_1 \dots a_1$ ($n$ times). The term $D^{-((n-1)L_1 + L_2)}$ corresponds to the encoding of the string $a_1 a_1 \dots a_1 a_2$ ($n-1$ times $a_1$ followed by $a_2$).

The exponent $n_1 L_1 + n_2 L_2 + \dots + n_K L_K$ represents the length of the encoding of a particular string of length $n$ over $A$.

Let’s consider a specific term $D^{-h}$ in the expanded sum. This term will appear multiple times, once for each string of length $n$ over $A$ whose encoding has length $h$. The number of such strings is $N_{n,h}$.

Therefore, we can rewrite the expanded sum as: \[\left( \sum_{i=1}^K D^{-L_i} \right)^n = \sum_{h=1}^{n L_K} N_{n,h} D^{-h}.\] Using the fact that $N_{n,h} \le D^h$, we have: \[\sum_{h=1}^{n L_K} N_{n,h} D^{-h} \le \sum_{h=1}^{n L_K} D^h D^{-h} = \sum_{h=1}^{n L_K} 1 = n L_K.\] Thus, \[\left( \sum_{i=1}^K D^{-L_i} \right)^n \le n L_K.\] If $\sum_{i=1}^K D^{-L_i} > 1$, then the left-hand side would grow exponentially with $n$, while the right-hand side grows linearly. This is a contradiction, so we must have $\sum_{i=1}^K D^{-L_i} \le 1$. ◻

Shannon’s First Theorem

Having established the fundamental properties of uniquely decodable codes, we now introduce the concept of a probability distribution over the input alphabet and its implications for code efficiency.

Let the alphabet $A$ have an associated probability distribution $P = (p_1, p_2, \dots, p_K)$, where $p_i$ represents the probability of the $i$-th letter $a_i \in A$. This distribution reflects the likelihood of each letter appearing in a message generated by a source.

Given a code $\varphi: A \to B^*$, the expected length (or code rate) of the code is defined as: \[E[\ell(X)] = \sum_{i=1}^K p_i L_i,\] where $L_i$ is the length of the codeword assigned to the $i$-th letter $a_i$.

The expected length, $E[\ell(X)]$, represents the average length of the encoding of a single letter, taking into account the probabilities of the letters. It provides a measure of the average number of symbols from the output alphabet $B$ required to encode a letter from the input alphabet $A$. This is also referred to as the code rate in the literature.

If a code $\varphi: A \to B^*$ is uniquely decodable, then its expected length is bounded below by the Shannon entropy: \[E[\ell(X)] \ge H_D(X) = -\sum_{i=1}^K p_i \log_D p_i,\] where $H_D(X)$ is the Shannon entropy of the source with respect to the output alphabet size $D$.

Shannon’s First Theorem establishes a fundamental lower bound on the expected length of any uniquely decodable code. It states that the average length of the codewords cannot be less than the Shannon entropy of the source. This theorem highlights the importance of entropy as a measure of the inherent compressibility of a source.

We will delve into the proof of this theorem in our next discussion.

--- title: "Uniquely Decodable Codes and Prefix Codes" author: "Your Name" date: "2025-01-28" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction {#sec:introduction} In our previous discussion, we introduced the concepts of uniquely decodable (UD) codes and prefix codes. We consider a code as a function $\varphi: A \to B^*$, where $A$ represents the input alphabet with cardinality $|A| = K$, and $B$ denotes the output alphabet with cardinality $|B| = D$. We established the definitions for both UD codes and prefix codes, and we noted a crucial property: if a code is a prefix code, it is inherently uniquely decodable without any delay. A key question arises when we seek to achieve maximal compression of information: is it essential to consider the entire family of uniquely decodable codes, or can we effectively limit our focus to prefix codes? This question forms the central theme of our current investigation. # Notation {#sec:notation} To ensure clarity and precision in our discussion, we introduce the following notation: - $\varphi(a)$: Represents the encoding of a letter $a$ belonging to the input alphabet $A$. - $L_i$: Denotes the length of the encoding of the $i$-th letter, $a_i$, in the input alphabet $A$. We also denote this as $\ell(a_i)$. This notation will be consistently used throughout the document. # Kraft-McMillan Theorem (Inverse Theorem) {#sec:kraft_mcmillan} The Kraft-McMillan theorem, also known as the inverse theorem, establishes a fundamental constraint on the lengths of codewords in a uniquely decodable code. ::: tcolorbox []{#thm:kraft_mcmillan label="thm:kraft_mcmillan"} If a code $\varphi: A \to B^*$ is uniquely decodable, then $$\sum_{i=1}^K D^{-L_i} \le 1,$$ where $D = |B|$ is the cardinality of the output alphabet $B$, and $L_i$ is the length of the encoding of the $i$-th letter of the input alphabet $A$. ::: ## Interpretation {#subsec:interpretation} The theorem's inequality provides a crucial insight into the relationship between codeword lengths and unique decodability. The term $D^{-L_i}$ decreases as the length $L_i$ increases. - **Large $L_i$ values:** If all $L_i$ values are large, the terms $D^{-L_i}$ become small, making it more likely that the sum $\sum_{i=1}^K D^{-L_i}$ will be less than or equal to 1. This corresponds to using longer codewords, which can be less efficient in terms of compression but ensures unique decodability. - **Small $L_i$ values:** Conversely, if all $L_i$ values are very small, the terms $D^{-L_i}$ become larger, potentially causing the sum to exceed 1. This indicates that the codewords are too short, and the code may not be uniquely decodable. For instance, consider an input alphabet $A$ with $|A| = 20$ and an output alphabet $B$ with $|B| = 2$. If we attempt to assign codes of length 2 to each letter in $A$, we would only have $2^2 = 4$ distinct codes available. This is insufficient to create an injective function $\varphi$, which is a necessary condition for unique decodability. The Kraft-McMillan theorem formalizes this intuition, stating that for a code to be uniquely decodable, the lengths of the encodings must be sufficiently large to satisfy the inequality. ## Implications {#subsec:implications} The Kraft-McMillan theorem has several important implications: - **Non-UD Code Detection:** The theorem can be used to prove that a code is *not* uniquely decodable. If $\sum_{i=1}^K D^{-L_i} > 1$, then the code is not UD. This is often easier than directly finding two different messages with the same encoding. - **Code Characteristics:** - If $\sum_{i=1}^K D^{-L_i} > 1$, the code is not UD, indicating that the codewords are too short. - If $\sum_{i=1}^K D^{-L_i}$ is very small (close to 0), the code is likely redundant, meaning that it uses unnecessarily long encodings. - If $\sum_{i=1}^K D^{-L_i}$ is close to 1, the code is likely a good and interesting UD code, in the sense that it is parsimonious, using the minimum quantity of letters needed for unique decodability. ## Historical Note {#subsec:historical_note} It is noteworthy that the Kraft-McMillan theorem was proved in 1958, a decade after Shannon's groundbreaking work on information theory. Shannon's original proof of his theorem was complex and less convincing. The Kraft-McMillan theorem provides a simpler and more elegant approach to proving Shannon's theorem, highlighting its significance in the field. ## Proof for Prefix Codes {#subsec:proof_prefix} We will first prove the Kraft-McMillan theorem for the special case where $\varphi$ is a prefix code. This proof provides an intuitive understanding of the theorem's underlying principles. :::: proof *Proof.* Without loss of generality, we assume that the input alphabet $A$ is sorted such that $L_1 \le L_2 \le \dots \le L_K$. We can represent the code $\varphi$ as a $D$-ary tree, where each branch represents a symbol from the output alphabet $B$. The root of the tree represents the empty string, and each node at depth $l$ represents a string of length $l$. Consider a complete $D$-ary tree of depth $L = L_K$, where $L_K$ is the length of the longest codeword. Since $\varphi$ is a prefix code, the codewords correspond to the leaves of the tree. ::: tcolorbox []{#ex:prefix_code_tree label="ex:prefix_code_tree"} Let $A = \{a, b, c, d\}$ and $B = \{0, 1, 2\}$. Suppose the code $\varphi$ is defined as follows: - $\varphi(a) = 0$ - $\varphi(b) = 11$ - $\varphi(c) = 12$ - $\varphi(d) = 211$ The corresponding tree representation is shown in [1](#fig:prefix_code_tree){reference-type="ref+Label" reference="fig:prefix_code_tree"}. <figure id="fig:prefix_code_tree"> <figcaption>Tree representation of the prefix code in Example <a href="#ex:prefix_code_tree" data-reference-type="ref" data-reference="ex:prefix_code_tree">[ex:prefix_code_tree]</a>.</figcaption> </figure> ::: Since the codeword for $a_1$ has length $L_1$, the path from the root to the node labeled $a_1$ has length $L_1$. Because $\varphi$ is a prefix code, no other codewords can be descendants of the node labeled $a_1$. The number of leaves in the subtree rooted at the node labeled $a_1$ is $D^{L_K - L_1}$. These leaves are \"forbidden\" because of the encoding of $a_1$. Similarly, for each $a_i$, the number of forbidden leaves is $D^{L_K - L_i}$. Since the subtrees rooted at each $a_i$ are disjoint, the total number of forbidden leaves for the first $K-1$ letters is: $$\sum_{i=1}^{K-1} D^{L_K - L_i}.$$ The total number of leaves in the complete $D$-ary tree of depth $L_K$ is $D^{L_K}$. Since we need at least one leaf for the last letter $a_K$, we have: $$D^{L_K} - \sum_{i=1}^{K-1} D^{L_K - L_i} \ge 1.$$ Dividing both sides by $D^{L_K}$, we get: $$1 - \sum_{i=1}^{K-1} D^{-L_i} \ge D^{-L_K}.$$ Rearranging the terms, we obtain: $$\sum_{i=1}^K D^{-L_i} \le 1.$$ ◻ :::: ## Direct Theorem {#subsec:direct_theorem} The direct theorem provides a converse to the Kraft-McMillan theorem, showing that the inequality is not only a necessary condition but also a sufficient condition for the existence of a prefix code. ::: tcolorbox []{#thm:direct_theorem label="thm:direct_theorem"} Let $A$ be an input alphabet with $|A| = K$ and $B$ be an output alphabet with $|B| = D$. Given a sequence of natural numbers $L_1, L_2, \dots, L_K$ such that $\sum_{i=1}^K D^{-L_i} \le 1$, there exists a prefix code $\varphi: A \to B^*$ such that the lengths of the encodings are exactly $L_1, L_2, \dots, L_K$. ::: ::: proof *Proof.* The proof is constructive and follows the same logic as the proof of the inverse theorem for prefix codes. We construct the prefix code by assigning codewords to letters in increasing order of their lengths. We start with a complete $D$-ary tree of depth $L_K$. We choose a path of length $L_1$ and label the corresponding node with $a_1$. This removes $D^{L_K - L_1}$ leaves from further consideration. We then choose the next available path of length $L_2$ and label it with $a_2$. We continue this process until all letters are assigned codewords. Since $\sum_{i=1}^K D^{-L_i} \le 1$, we are guaranteed to have enough leaves to assign codewords to all letters without creating any prefixes. ◻ ::: ::: tcolorbox []{#ex:direct_theorem_example label="ex:direct_theorem_example"} Let $A = \{a, b, c, d, e\}$ and $B = \{0, 1\}$. Consider the lengths $L_1 = 1, L_2 = 3, L_3 = 3, L_4 = 3, L_5 = 3$. We have: $$\sum_{i=1}^5 2^{-L_i} = \frac{1}{2} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8} = 1.$$ Since the inequality is satisfied, we can construct a prefix code with these lengths: - $\varphi(a) = 0$ - $\varphi(b) = 100$ - $\varphi(c) = 101$ - $\varphi(d) = 110$ - $\varphi(e) = 111$ ::: ## General Proof of the Inverse Theorem {#subsec:general_proof} Now, we present the general proof of the Kraft-McMillan theorem, which applies to all uniquely decodable codes, not just prefix codes. ::: proof *Proof.* Without loss of generality, assume that the input alphabet $A$ is sorted such that $L_1 \le L_2 \le \dots \le L_K$. Let $N_{n,h}$ be the number of strings of length $n$ over the input alphabet $A$ that have an encoding of length $h$. The total number of strings of length $n$ over $A$ is $K^n$. Since the code is uniquely decodable, the number of strings of length $n$ over $A$ with an encoding of length $h$ cannot exceed the number of strings of length $h$ over $B$, which is $D^h$. Thus, $N_{n,h} \le D^h$. Consider the sum $\sum_{i=1}^K D^{-L_i}$. We want to show that this sum is less than or equal to 1. Instead, we will study the sum raised to the power of $n$: $$\left( \sum_{i=1}^K D^{-L_i} \right)^n.$$ When we expand this expression, we get a sum of terms of the form $D^{-(n_1 L_1 + n_2 L_2 + \dots + n_K L_K)}$, where $n_1 + n_2 + \dots + n_K = n$. Each term corresponds to the length of the encoding of a string of length $n$ over $A$. For example, the term $D^{-nL_1}$ corresponds to the encoding of the string $a_1 a_1 \dots a_1$ ($n$ times). The term $D^{-((n-1)L_1 + L_2)}$ corresponds to the encoding of the string $a_1 a_1 \dots a_1 a_2$ ($n-1$ times $a_1$ followed by $a_2$). The exponent $n_1 L_1 + n_2 L_2 + \dots + n_K L_K$ represents the length of the encoding of a particular string of length $n$ over $A$. Let's consider a specific term $D^{-h}$ in the expanded sum. This term will appear multiple times, once for each string of length $n$ over $A$ whose encoding has length $h$. The number of such strings is $N_{n,h}$. Therefore, we can rewrite the expanded sum as: $$\left( \sum_{i=1}^K D^{-L_i} \right)^n = \sum_{h=1}^{n L_K} N_{n,h} D^{-h}.$$ Using the fact that $N_{n,h} \le D^h$, we have: $$\sum_{h=1}^{n L_K} N_{n,h} D^{-h} \le \sum_{h=1}^{n L_K} D^h D^{-h} = \sum_{h=1}^{n L_K} 1 = n L_K.$$ Thus, $$\left( \sum_{i=1}^K D^{-L_i} \right)^n \le n L_K.$$ If $\sum_{i=1}^K D^{-L_i} > 1$, then the left-hand side would grow exponentially with $n$, while the right-hand side grows linearly. This is a contradiction, so we must have $\sum_{i=1}^K D^{-L_i} \le 1$. ◻ ::: # Shannon's First Theorem {#sec:shannons_first_theorem} Having established the fundamental properties of uniquely decodable codes, we now introduce the concept of a probability distribution over the input alphabet and its implications for code efficiency. Let the alphabet $A$ have an associated probability distribution $P = (p_1, p_2, \dots, p_K)$, where $p_i$ represents the probability of the $i$-th letter $a_i \in A$. This distribution reflects the likelihood of each letter appearing in a message generated by a source. ::: tcolorbox []{#def:expected_length label="def:expected_length"} Given a code $\varphi: A \to B^*$, the expected length (or code rate) of the code is defined as: $$E[\ell(X)] = \sum_{i=1}^K p_i L_i,$$ where $L_i$ is the length of the codeword assigned to the $i$-th letter $a_i$. ::: The expected length, $E[\ell(X)]$, represents the average length of the encoding of a single letter, taking into account the probabilities of the letters. It provides a measure of the average number of symbols from the output alphabet $B$ required to encode a letter from the input alphabet $A$. This is also referred to as the code rate in the literature. ::: tcolorbox []{#thm:shannons_first_theorem label="thm:shannons_first_theorem"} If a code $\varphi: A \to B^*$ is uniquely decodable, then its expected length is bounded below by the Shannon entropy: $$E[\ell(X)] \ge H_D(X) = -\sum_{i=1}^K p_i \log_D p_i,$$ where $H_D(X)$ is the Shannon entropy of the source with respect to the output alphabet size $D$. ::: Shannon's First Theorem establishes a fundamental lower bound on the expected length of any uniquely decodable code. It states that the average length of the codewords cannot be less than the Shannon entropy of the source. This theorem highlights the importance of entropy as a measure of the inherent compressibility of a source. We will delve into the proof of this theorem in our next discussion.