Information Theory and Learning Algorithms - Lecture Notes

Author

Your Name

Published

January 28, 2025

Administrative Notes

  • Time table and classes are subject to change in the first weeks. Please check regularly.

  • There is a possibility of shifting Professor Pupis’s lecture earlier to create a longer lunch break. This will be discussed with Professor Pupis.

  • Lecture time slots can be changed if room availability and instructor schedule allow. Please communicate any scheduling conflicts.

  • Students enrolled in the International Master degree in Artificial Intelligence and Cyber Security will be officially enrolled in Udine for the first semester. They will be added to the Klagenfurt system in January.

  • Klagenfurt enrollment is for three semesters with a small fee (less than 20 euros per semester).

  • Students should complete their Klagenfurt courses within these three semesters to avoid additional tuition fees.

  • A meeting will be held in January to explain Erasmus grant rules for the semester in Klagenfurt.

  • Erasmus grants will be awarded based on the student’s performance in the first semester. Focus on achieving high grades and credits this semester.

Entropy: Review and Properties

Recap from Last Time

Last time, we introduced the concept of entropy and discussed some of its properties. The primary reference for this material is Chapter 2 of "Information Theory, Inference, and Learning Algorithms" by David J.C. MacKay [@MacKay2003].

Definition of Entropy

The entropy of a discrete random variable \(X\) with probability distribution \(P = (p_1, p_2, \dots, p_K)\) is defined as: \[H(P) = - \sum_{i=1}^K p_i \log_2 p_i = \sum_{i=1}^K p_i \log_2 \frac{1}{p_i}\]

Property 3: Entropy of Uniform Distribution

If \(P\) is a uniform distribution over \(K\) outcomes, then \(H(P) = \log_2 K\).

Proof. Proof. Since \(P\) is uniform, \(p_i = \frac{1}{K}\) for all \(i\). Thus, \[H(P) = \sum_{i=1}^K \frac{1}{K} \log_2 K = \frac{1}{K} \log_2 K \sum_{i=1}^K 1 = \log_2 K\] ◻

To prove that the entropy of any distribution \(P\) is less than or equal to the entropy of the uniform distribution, we will use Jensen’s inequality.

Jensen’s Inequality

Consider the function \(f(x) = -x \ln x\) for \(x > 0\).

The function \(f(x) = -x \ln x\) is concave for \(x > 0\).

Proof. Proof. The first derivative is \(f'(x) = -(\ln x + 1)\), and the second derivative is \(f''(x) = -\frac{1}{x}\). Since \(f''(x) < 0\) for \(x > 0\), \(f(x)\) is concave. ◻

For a concave function \(f(x)\), and any set of points \(x_i\) and non-negative weights \(\lambda_i\) such that \(\sum_i \lambda_i = 1\), we have \[f\left(\sum_i \lambda_i x_i\right) \geq \sum_i \lambda_i f(x_i)\]

Proof. Proof. (Intuitive explanation) For two points \(x_1\) and \(x_2\), the linear combination \(\lambda x_1 + (1-\lambda) x_2\) lies on the line segment connecting \(x_1\) and \(x_2\). Due to concavity, the function value at this linear combination is greater than or equal to the corresponding linear combination of the function values. This extends to multiple points by induction. ◻

Proof of \(H(P) \leq \log_2 K\)

For any probability distribution \(P\) over \(K\) outcomes, \(H(P) \leq \log_2 K\).

Proof. Proof. Let \(P = (p_1, p_2, \dots, p_K)\) be any probability distribution. Using Jensen’s inequality with \(f(x) = -x \log_2 x\), \(\lambda_i = \frac{1}{K}\), and \(x_i = p_i\), we get: \[\sum_{i=1}^K \frac{1}{K} (-p_i \log_2 p_i) \leq -\left(\sum_{i=1}^K \frac{1}{K} p_i\right) \log_2 \left(\sum_{i=1}^K \frac{1}{K} p_i\right)\] Simplifying, \[\frac{1}{K} H(P) \leq -\left(\frac{1}{K}\right) \log_2 \left(\frac{1}{K}\right) = \frac{1}{K} \log_2 K\] Thus, \(H(P) \leq \log_2 K\). ◻

Example: Flipping n Fair Coins

If we flip \(n\) fair coins, there are \(2^n\) possible outcomes, each with probability \(\frac{1}{2^n}\). The entropy is: \[H(Q) = \sum_{i=1}^{2^n} \frac{1}{2^n} \log_2 2^n = \log_2 2^n = n\] This means we receive \(n\) bits of information.

Joint Entropy

Definition

The joint entropy of two random variables \(X\) and \(Y\) with joint distribution \(P(X,Y)\) is: \[H(X,Y) = -\sum_{i,j} P(x_i, y_j) \log_2 P(x_i, y_j)\]

Property: Independent Events

If \(X\) and \(Y\) are independent, then \(H(X,Y) = H(X) + H(Y)\).

Proof. Proof. Since \(X\) and \(Y\) are independent, \(P(x_i, y_j) = P(x_i)P(y_j)\). \[H(X,Y) = -\sum_{i,j} P(x_i)P(y_j) \log_2 (P(x_i)P(y_j))\] Using the logarithm property, \[H(X,Y) = -\sum_{i,j} P(x_i)P(y_j) (\log_2 P(x_i) + \log_2 P(y_j))\] Splitting the sum, \[H(X,Y) = -\sum_i P(x_i) \log_2 P(x_i) \sum_j P(y_j) - \sum_j P(y_j) \log_2 P(y_j) \sum_i P(x_i)\] Since \(\sum_j P(y_j) = 1\) and \(\sum_i P(x_i) = 1\), \[H(X,Y) = -\sum_i P(x_i) \log_2 P(x_i) - \sum_j P(y_j) \log_2 P(y_j) = H(X) + H(Y)\] ◻

General Case

If \(X\) and \(Y\) are not independent, \[H(X,Y) \leq H(X) + H(Y)\] \[H(X,Y) = H(X) + H(Y|X)\] where \(H(Y|X)\) is the conditional entropy of \(Y\) given \(X\).

Decomposability of Entropy

Let \(P = (p_1, p_2, \dots, p_K)\) be a probability distribution. We can decompose the entropy as follows: \[\begin{aligned} H(P) &= H(p, 1-p) + p H\left(\frac{p_1}{p}, \dots, \frac{p_m}{p}\right) + (1-p) H\left(\frac{p_{m+1}}{1-p}, \dots, \frac{p_K}{1-p}\right) \end{aligned}\] where \(p = \sum_{i=1}^m p_i\).

The decomposability property allows us to compute the entropy of a distribution by splitting it into parts and considering the entropy of each part and the entropy of the split itself. This is useful in inference processes to understand how much information is gained at each step of a multi-stage experiment.

Proof. Proof. (Sketch) The proof involves algebraic manipulation using the definition of entropy and properties of logarithms. The key idea is to rewrite the sum over all probabilities as a sum over the two groups of probabilities and then use the definition of conditional entropy. ◻

Kullback-Leibler Divergence

Definition

The Kullback-Leibler divergence (or simply divergence) between two distributions \(P\) and \(Q\) over the same set of events is: \[D_{KL}(P||Q) = \sum_i p_i \log_2 \frac{p_i}{q_i}\]

Interpretation

\(D_{KL}(P||Q)\) measures the extra bits needed to encode data from distribution \(P\) when using a code optimized for distribution \(Q\). It quantifies how much one distribution differs from another.

Properties

  • \(D_{KL}(P||Q) \geq 0\)

  • \(D_{KL}(P||Q) = 0\) if and only if \(P = Q\)

Relation to Machine Learning

Divergence is related to maximum a posteriori (MAP) and minimum description length (MDL) learning methods.

MAP is a method of estimating the parameters of a statistical model by maximizing the posterior probability given the observed data.

MDL is a principle for model selection that favors models that minimize the sum of the description length of the model and the description length of the data given the model.

When no prior information is available, minimizing divergence is related to maximum likelihood estimation.

The connection between KL divergence and maximum likelihood estimation is a deep topic that requires further study in statistical learning theory.

Source Coding Theorem

Motivating Example: The Twelve Coins Problem

We have 12 coins, one of which is counterfeit and has a different weight (either heavier or lighter). We want to find the counterfeit coin and determine whether it is heavier or lighter using a two-pan balance.

Possible Outcomes

There are 24 possible outcomes (12 coins * 2 possibilities for each coin).

Minimum Number of Weighings

With each weighing, we have 3 possible outcomes (left pan heavier, right pan heavier, or equal). To identify one out of 24 possibilities, we need at least 3 weighings since \(3^2 < 24 < 3^3\).

Strategy

Weighing 1: Place coins 1, 2, 3, 4 on the left pan and coins 5, 6, 7, 8 on the right pan. Case 1: Pans are equal: The counterfeit coin is among 9, 10, 11, 12. Weighing 2: Place coins 9 and 10 on the left pan and two known good coins (e.g., 1 and 2) on the right pan. Case 1.1: Pans are equal: The counterfeit coin is either 11 or 12. Weighing 3: Compare coin 11 with a known good coin. If they are equal, coin 12 is counterfeit. Otherwise, coin 11 is counterfeit. Case 1.2: Left pan is heavier: Either coin 9 or 10 is heavier. Weighing 3: Compare coin 9 with a known good coin. If it is heavier, coin 9 is counterfeit. Otherwise, coin 10 is counterfeit. Case 1.3: Right pan is heavier: Either coin 9 or 10 is lighter. Weighing 3: Compare coin 9 with a known good coin. If it is lighter, coin 9 is counterfeit. Otherwise, coin 10 is counterfeit. Case 2: Left pan is heavier: The counterfeit coin is among 1, 2, 3, 4 (heavier) or 5, 6, 7, 8 (lighter). Weighing 2: Place coins 1, 2, and 5 on the left pan and coins 3, 6, and a known good coin on the right pan. Case 2.1: Left pan is heavier: Either coin 1 or 2 is heavier, or coin 6 is lighter. Weighing 3: Compare coin 1 with a known good coin. If it is heavier, coin 1 is counterfeit. Otherwise, compare coin 2 with a known good coin. If it is heavier, coin 2 is counterfeit. Otherwise, coin 6 is lighter. Case 2.2: Right pan is heavier: Either coin 3 is heavier or coin 5 is lighter. Weighing 3: Compare coin 3 with a known good coin. If it is heavier, coin 3 is counterfeit. Otherwise, coin 5 is lighter. Case 2.3: Pans are equal: Either coin 4 is heavier or coin 7 or 8 is lighter. Weighing 3: Compare coin 4 with a known good coin. If it is heavier, coin 4 is counterfeit. Otherwise, compare coin 7 with a known good coin. If it is lighter, coin 7 is counterfeit. Otherwise, coin 8 is lighter. Case 3: Right pan is heavier: The counterfeit coin is among 1, 2, 3, 4 (lighter) or 5, 6, 7, 8 (heavier). Weighing 2: Place coins 1, 2, and 5 on the left pan and coins 3, 6, and a known good coin on the right pan. Case 3.1: Left pan is heavier: Either coin 5 is heavier or coin 1 or 2 is lighter. Weighing 3: Compare coin 5 with a known good coin. If it is heavier, coin 5 is counterfeit. Otherwise, compare coin 1 with a known good coin. If it is lighter, coin 1 is counterfeit. Otherwise, coin 2 is lighter. Case 3.2: Right pan is heavier: Either coin 3 is lighter or coin 6 is heavier. Weighing 3: Compare coin 3 with a known good coin. If it is lighter, coin 3 is counterfeit. Otherwise, coin 6 is heavier. Case 3.3: Pans are equal: Either coin 4 is lighter or coin 7 or 8 is heavier. Weighing 3: Compare coin 4 with a known good coin. If it is lighter, coin 4 is counterfeit. Otherwise, compare coin 7 with a known good coin. If it is heavier, coin 7 is counterfeit. Otherwise, coin 8 is heavier.

The algorithm above provides a complete strategy for identifying the counterfeit coin and its weight using a two-pan balance in a maximum of three weighings.

Example: The Three Doors Problem

Read the statement of the problem about the three doors (Exercise 3.8 in [@MacKay2003]). You choose a door, the host opens another door revealing no prize, and you are given the option to switch. What is the optimal strategy?

This problem will be discussed in the next lecture.

9 David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.