Information Theory and Coding

Author

Your Name

Published

January 28, 2025

Introduction

This document provides a comprehensive summary of the lecture notes on information theory. The core topics covered include:

  • The classic 12-coin problem and its solution strategy.

  • The fundamental concept of entropy and its application in quantifying information.

  • The notion of mutual information and its role in understanding relationships between random variables.

  • The principles of source coding, including uniquely decodable and prefix codes.

These topics are essential for understanding the theoretical underpinnings of data compression and information transmission.

The 12-Coin Problem

The 12-coin problem is a classic puzzle in information theory. It involves identifying a counterfeit coin among 12 coins using a balance scale. The counterfeit coin is known to have a different weight than the genuine coins, and the goal is to determine both the identity of the counterfeit coin and whether it is heavier or lighter, using the minimum number of weighings.

Initial Possibilities

Initially, there are 24 possible scenarios, as each of the 12 coins could be either lighter or heavier than the genuine coins. These possibilities can be listed as follows:

  • Coin 1 is lighter (1 less), Coin 1 is heavier (1 more)

  • Coin 2 is lighter (2 less), Coin 2 is heavier (2 more)

  • Coin 12 is lighter (12 less), Coin 12 is heavier (12 more)

First Weighing

The first step involves comparing a group of coins against another group. A common strategy is to compare coins 1, 2, 3, and 4 against coins 5, 6, 7, and 8. This comparison yields three possible outcomes:

  1. Left Side Lighter: The group of coins (1, 2, 3, 4) is lighter than the group (5, 6, 7, 8).

  2. Equal Weight: Both groups of coins have the same weight.

  3. Right Side Lighter: The group of coins (5, 6, 7, 8) is lighter than the group (1, 2, 3, 4).

Case: Equal Weight in the First Weighing

If the first weighing results in equal weight, it indicates that the counterfeit coin is among the remaining coins: 9, 10, 11, and 12. This reduces the number of possibilities to 8:

  • Coin 9 is lighter (9 less), Coin 9 is heavier (9 more)

  • Coin 10 is lighter (10 less), Coin 10 is heavier (10 more)

  • Coin 11 is lighter (11 less), Coin 11 is heavier (11 more)

  • Coin 12 is lighter (12 less), Coin 12 is heavier (12 more)

Second Weighing

In the second weighing, we compare coins 9, 10, and 11 against three known genuine coins (e.g., coins 1, 2, and 3). This again results in three possible outcomes:

  1. Left Side Lighter: The group of coins (9, 10, 11) is lighter than the group (1, 2, 3).

  2. Equal Weight: Both groups of coins have the same weight.

  3. Right Side Lighter: The group of coins (1, 2, 3) is lighter than the group (9, 10, 11).

Case: Equal Weight in the Second Weighing

If the second weighing also results in equal weight, it means that the counterfeit coin is coin 12. However, we still need to determine whether it is heavier or lighter. This leaves us with two possibilities:

  • Coin 12 is lighter (12 less)

  • Coin 12 is heavier (12 more)

Third Weighing

To determine whether coin 12 is heavier or lighter, we compare it against a known genuine coin (e.g., coin 1). This final weighing has two possible outcomes:

  1. Coin 12 Lighter: Coin 12 is lighter than coin 1.

  2. Coin 12 Heavier: Coin 12 is heavier than coin 1.

This completes the identification of the counterfeit coin and its weight discrepancy.

Entropy and Information Gain

Entropy is a fundamental concept in information theory that quantifies the uncertainty or randomness associated with a probability distribution. It measures the average amount of information needed to describe the outcome of a random variable.

Initial Entropy

At the start of the 12-coin problem, we have 24 equally likely possibilities (each coin could be heavier or lighter). The initial probability distribution \(P\) is uniform: \[P = \left(\frac{1}{24}, \frac{1}{24}, \dots, \frac{1}{24}\right)\] The entropy \(H(P)\) for this distribution is calculated as: \[\begin{aligned} H(P) &= -\sum_{i=1}^{24} P(i) \log_2(P(i)) \\ &= -\sum_{i=1}^{24} \frac{1}{24} \log_2\left(\frac{1}{24}\right) \\ &= \sum_{i=1}^{24} \frac{1}{24} \log_2(24) \\ &= \log_2(24) \end{aligned}\] This represents the initial uncertainty we have about the counterfeit coin.

Entropy After First Weighing

After the first weighing, if the scale balances, we are left with 8 equally likely possibilities (coins 9 through 12, each either heavier or lighter). The new probability distribution \(P'\) is: \[P' = \left(\frac{1}{8}, \frac{1}{8}, \dots, \frac{1}{8}\right)\] The entropy \(H(P')\) for this distribution is: \[\begin{aligned} H(P') &= -\sum_{i=1}^{8} P'(i) \log_2(P'(i)) \\ &= -\sum_{i=1}^{8} \frac{1}{8} \log_2\left(\frac{1}{8}\right) \\ &= \sum_{i=1}^{8} \frac{1}{8} \log_2(8) \\ &= \log_2(8) = 3 \end{aligned}\] The entropy has decreased, indicating a reduction in uncertainty.

Information Gain

The information gained from the first weighing is the difference between the initial entropy and the entropy after the first weighing. This quantifies how much uncertainty was resolved by the measurement:

\[\begin{aligned} \text{Information Gain} &= H(P) - H(P') \\ &= \log_2(24) - \log_2(8) \\ &= \log_2\left(\frac{24}{8}\right) \\ &= \log_2(3) \end{aligned}\] Since \(\log_2(3) \approx 1.585\), the information gain is greater than 1 bit. This is because the first weighing has three possible outcomes, which provides more information than a binary outcome. The information gain represents the amount of uncertainty reduced by the first weighing.

Encoding Procedure

The 12-coin problem can be reframed as an encoding problem, where the goal is to transmit information about the counterfeit coin and its weight discrepancy using a sequence of symbols. This perspective highlights the connection between inference and encoding.

Input and Output Alphabets

The input alphabet \(A\) represents all possible states of the counterfeit coin: \[A = \{1\text{ less}, 1\text{ more}, 2\text{ less}, 2\text{ more}, \dots, 12\text{ less}, 12\text{ more}\}\] This alphabet has 24 elements, corresponding to the 24 initial possibilities. The output alphabet \(B\) consists of the symbols representing the outcomes of each weighing: \[B = \{=, L, R\}\] where:

  • ‘=’ represents that both sides of the balance scale have equal weight.

  • ‘L’ represents that the left side of the balance scale is lighter.

  • ‘R’ represents that the right side of the balance scale is lighter.

This ternary alphabet allows us to encode the results of each weighing.

Encoding Example

Consider the sequence of weighings that leads to the identification of coin 12 as lighter. The corresponding message in the output alphabet would be "= = L". This message is interpreted as follows:

  1. First Weighing: The first weighing resulted in equal weight (‘=’), indicating that the counterfeit coin is among coins 9, 10, 11, and 12.

  2. Second Weighing: The second weighing also resulted in equal weight (‘=’), further narrowing down the possibilities to coin 12.

  3. Third Weighing: The third weighing indicated that coin 12 is lighter (‘L’).

Therefore, the message "= = L" uniquely decodes to "12 less". This demonstrates how the sequence of weighing outcomes can be used to encode the final result.

Optimal Strategy and Encoding

The strategy of comparing four coins against four in the first weighing is crucial for achieving an optimal solution. This strategy minimizes the worst-case number of weighings required to identify the counterfeit coin. If we were to use a different strategy, such as comparing five coins against five, it could lead to longer sequences of weighings in the worst-case scenario. This is because the three possible outcomes of each weighing provide a balanced reduction in the number of possibilities. The optimal strategy in the 12-coin problem corresponds to an optimal encoding in terms of the shortest possible average length of the message. This concept is related to Shannon’s first theorem, which will be discussed later.

The Three Doors Problem

The three doors problem, also known as the Monty Hall problem, is a classic probability puzzle that often leads to counterintuitive conclusions. It illustrates how conditional probabilities can affect decision-making.

Problem Setup

The problem involves three doors. Behind one door is a prize, and behind the other two doors are goats (or nothing). The player initially chooses one door. The host, who knows where the prize is, then opens one of the other two doors, always revealing a goat. The player is then given the option to switch their choice to the remaining closed door or stay with their initial choice.

Initial Probabilities

Let \(H_i\) represent the hypothesis that the prize is behind door \(i\), where \(i \in \{1, 2, 3\}\). Initially, the probability of the prize being behind any of the three doors is equal: \[P(H_1) = P(H_2) = P(H_3) = \frac{1}{3}\]

Conditional Probabilities

Let \(D_i\) be the event that the host opens door \(i\). We are interested in comparing the probability of the prize being behind door 1 given that the host opened door 2, \(P(H_1 | D_2)\), with the probability of the prize being behind door 3 given that the host opened door 2, \(P(H_3 | D_2)\). We assume that the player initially chose door 1.

Using Bayes’ theorem, we can express these conditional probabilities as: \[P(H_1 | D_2) = \frac{P(D_2 | H_1) P(H_1)}{P(D_2)}\] \[P(H_3 | D_2) = \frac{P(D_2 | H_3) P(H_3)}{P(D_2)}\] where \(P(D_2)\) is the probability of the host opening door 2, which is the same for both calculations and can be ignored for comparison purposes.

Calculating Probabilities

We need to calculate the conditional probabilities \(P(D_2 | H_1)\) and \(P(D_2 | H_3)\):

  • \(P(D_2 | H_1) = \frac{1}{2}\): If the prize is behind door 1, the host can open either door 2 or door 3 with equal probability, since both doors have goats.

  • \(P(D_2 | H_3) = 1\): If the prize is behind door 3, the host must open door 2, since the player chose door 1 and the host cannot open the door with the prize.

Substituting these values into the Bayes’ theorem equations: \[P(H_1 | D_2) = \frac{\frac{1}{2} \cdot \frac{1}{3}}{P(D_2)}\] \[P(H_3 | D_2) = \frac{1 \cdot \frac{1}{3}}{P(D_2)}\] Since \(P(D_2)\) is the same in both equations, we can compare the numerators: \[P(H_1 | D_2) \propto \frac{1}{2} \cdot \frac{1}{3} = \frac{1}{6}\] \[P(H_3 | D_2) \propto 1 \cdot \frac{1}{3} = \frac{1}{3}\] Normalizing these probabilities, we get: \[P(H_1 | D_2) = \frac{1}{3}\] \[P(H_3 | D_2) = \frac{2}{3}\]

Conclusion

The analysis shows that after the host opens a door revealing a goat, the probability of the prize being behind the remaining closed door (door 3 in this case) is \(\frac{2}{3}\), while the probability of the prize being behind the initially chosen door (door 1) is only \(\frac{1}{3}\). Therefore, it is always better to switch doors. By switching, the player doubles their chances of winning the prize. This counterintuitive result highlights the importance of considering conditional probabilities when making decisions.

Mutual Information

Mutual information is a measure that quantifies the amount of information that one random variable provides about another. It is a crucial concept in information theory, particularly when analyzing the relationships between different variables.

Joint Entropy

The joint entropy of two discrete random variables \(X\) and \(Y\), denoted as \(H(X, Y)\), measures the uncertainty associated with the joint distribution of \(X\) and \(Y\). It is defined as: \[H(X, Y) = -\sum_{x \in X} \sum_{y \in Y} p(x, y) \log_2 p(x, y)\] where \(p(x, y)\) is the joint probability mass function of \(X\) and \(Y\).

Conditional Entropy

The conditional entropy of \(Y\) given \(X\), denoted as \(H(Y|X)\), measures the uncertainty about \(Y\) when the value of \(X\) is known. It is defined as: \[H(Y|X) = -\sum_{x \in X} \sum_{y \in Y} p(x, y) \log_2 p(y|x)\] where \(p(y|x)\) is the conditional probability of \(Y\) given \(X\).

Mutual Information Definition

Mutual information \(I(X; Y)\) is defined as the reduction in uncertainty about one random variable due to knowledge of the other. It can be expressed in two equivalent forms:

  1. Using joint and conditional entropies: \[I(X; Y) = H(X, Y) - H(X|Y) - H(Y|X)\]

  2. Using individual and joint entropies: \[I(X; Y) = H(X) + H(Y) - H(X, Y)\]

Both definitions are equivalent and provide different perspectives on the concept of mutual information.

Interpretation

Mutual information measures the amount of information that \(X\) provides about \(Y\) (or vice versa). It quantifies the statistical dependence between the two random variables.

  • Independence: If \(X\) and \(Y\) are independent, then knowing the value of \(X\) does not provide any information about \(Y\), and vice versa. In this case, \(I(X; Y) = 0\).

  • Complete Dependence: If \(X\) and \(Y\) are the same (or linearly dependent), then knowing the value of \(X\) completely determines the value of \(Y\), and vice versa. In this case, \(I(X; Y) = H(X) = H(Y)\).

In general, the mutual information is non-negative, \(I(X;Y) \geq 0\), and it is symmetric, \(I(X;Y) = I(Y;X)\).

Relation to Divergence

Mutual information can also be expressed in terms of the Kullback-Leibler (KL) divergence, also known as relative entropy. The KL divergence \(D(p||q)\) measures the difference between two probability distributions \(p\) and \(q\). The mutual information between \(X\) and \(Y\) can be written as: \[I(X; Y) = D(p(x, y) || p(x)p(y))\] where \(p(x, y)\) is the joint distribution of \(X\) and \(Y\), and \(p(x)p(y)\) is the product of the marginal distributions of \(X\) and \(Y\). This expression shows that mutual information measures how much the joint distribution deviates from the assumption of independence.

Source Codes

Source coding is the process of converting data from a source into a more efficient representation, typically for compression or transmission purposes. This section introduces the fundamental concepts of source codes, their properties, and different types of codes.

Definitions

Let \(A\) be the input alphabet, which is a finite set of symbols, and let \(B\) be the output alphabet, which is also a finite set of symbols. A source code is a function: \[\phi: A^* \to B^*\] where \(A^*\) is the set of all possible strings (sequences) over the input alphabet \(A\), and \(B^*\) is the set of all possible strings over the output alphabet \(B\). The function \(\phi\) maps a message (a string of symbols from \(A^*\)) to a codeword (a string of symbols from \(B^*\)).

Desirable Properties

A good source code should have the following properties:

  • Computability: The encoding function \(\phi\) should be computable, meaning that there exists an algorithm to perform the encoding.

  • Uniquely Decodable: The encoding function \(\phi\) should be injective, meaning that each message in \(A^*\) maps to a unique codeword in \(B^*\). This ensures that the encoded message can be unambiguously decoded back to the original message. Such codes are called uniquely decodable (UD).

  • Decodability: The inverse function (decoding function) should also be computable, allowing the receiver to recover the original message from the encoded message.

  • Efficiency: Ideally, both encoding and decoding should be computable in linear time and constant space, making the process efficient for large messages.

Extension of Codes

In practice, source codes are often defined as a mapping from individual symbols of the input alphabet to strings of the output alphabet: \[f: A \to B^*\] The code is then extended to strings by concatenating the codewords of the individual symbols: \[\phi(a_1 a_2 \dots a_n) = f(a_1) f(a_2) \dots f(a_n)\] where \(a_i \in A\) for \(i = 1, 2, \dots, n\).

Examples

Here are some examples of source codes:

  • Morse Code: A variable-length code where each letter is encoded as a sequence of dots and dashes. For example, S is encoded as ‘...’ (000), O is encoded as ‘—’ (111), and E is encoded as ‘.’ (0).

  • ASCII: A block code where each character is encoded as a sequence of 7 bits. For example, the capital letter A is encoded as ‘1000001’.

Uniquely Decodable Codes

A code is uniquely decodable (UD) if its extension \(\phi\) is injective. This means that for any two distinct messages \(m_1, m_2 \in A^*\), their encoded messages \(\phi(m_1)\) and \(\phi(m_2)\) are also distinct. This property is essential for unambiguous decoding.

Prefix Codes

A code is prefix-free (or simply a prefix code) if no codeword is a prefix of any other codeword. Formally, for all \(a_1, a_2 \in A\), \(f(a_1)\) is not a prefix of \(f(a_2)\). Prefix codes are also known as instantaneous codes because they can be decoded immediately without looking ahead at subsequent symbols.

Theorem

Prefix codes are uniquely decodable and can be decoded without delay. This means that the decoding process can be performed in linear time and constant space, making them highly efficient.

Optimal Compression

Prefix codes are sufficient for achieving optimal compression. This means that for any uniquely decodable code, there exists a prefix code that achieves the same or better compression. This result is significant because it simplifies the search for optimal codes. The relationship between prefix codes and optimal compression will be further explored in the context of Shannon’s first theorem.