Introduction to Information Theory and Computational Complexity

Author

Your Name

Published

January 28, 2025

Course Overview

Resources and Materials

The primary course materials, including detailed references and a tentative program, are accessible via the designated e-learning platform.
Students have the option to request access to recordings of the previous year’s lectures, which may offer improved visibility of the blackboard content.
For those interested in joining the old team to access previous recordings, send a request to be enrolled.

Course Structure

The course is structured into three interconnected parts, each building upon the previous one:

Information Theory: This section provides a foundational understanding of information, its quantification, and its application to diverse areas.
- Core Concept: Introduction to Shannon entropy and its role in quantifying information.
- Applications: Exploration of the practical use of Shannon entropy in two key domains:
  - Data compression, including code construction for efficient storage and transmission.
  - Inference processes within machine learning, emphasizing entropy reduction.
- Key Concepts: The fundamental concepts to be covered include:
  - Entropy as a measure of uncertainty.
  - Coding theory for information representation.
  - Data compression algorithms and techniques.
  - The principles of inference and learning in the context of entropy.
- Primary References: The theoretical basis will primarily rely on:
  - “Elements of Information Theory” by Cover and Thomas. A classic textbook that establishes fundamental concepts in Information Theory.
  - “Information Theory, Inference, and Learning Algorithms” by David J.C. MacKay. Focus on applications in inference and learning algorithms.
  - “A Mathematical Theory of Communication” by Claude E. Shannon. The original paper where the concept of Shannon entropy was introduced.
- Additional References: Further references for specific coding methods such as Lempel-Ziv and other relevant topics.
Computational Complexity: This part shifts focus to the inherent difficulty of computational problems, exploring how efficient algorithms can be created for these.
- Core concept: An overview of classical complexity theory, its central questions, and important classes of problems.
- Main Reference: “Computational Complexity” by Christos H. Papadimitriou. We will focus on specific relevant chapters.
- Additional Materials: There will also be some supplementary materials for additional insight in computational complexity
Graph Algorithms and Data Structures (Time Permitting): This section will focus on techniques to solve graph problems with their structure and implementation.
- Focus Areas: Should time permit, this will introduce key concepts related to:
  - Reachability as a basic property for graphs.
  - Representations and simulations.
  - Reduction in graph problems and the main challenges to these type of structures.

Teaching Approach

The primary teaching method is through traditional blackboard lectures, offering a step-by-step approach to building an intuition for new concepts.
Given that blackboard visibility may not be optimal in this lecture room, the instructor advises accessing the prior year’s recordings, where the board visibility is better.
The initial lecture provides a high-level and gentle introduction to information theory concepts with examples of messages. The remaining part of the course will focus on mathematical details, theorems and proofs.

Introduction to Shannon Entropy

Motivation: The Concept of Information

The information content of a message is fundamentally linked to its surprise effect. This perspective departs from simply observing the meaning of the content to also measure how unexpected or informative a message might be.
The lower the probability of a message or event, the higher the surprise and consequently the information it carries. Intuitively, rare or unlikely messages tend to give new insights while common or more probable messages, are less informative.
In mathematical terms, the notion of a message surprise or novelty, can be expressed using its probability. The probability gives us an understanding about the importance or informational value a message might bring.

Example: Fatima and Mr. Muller

Consider the example where Mr. Muller says, “Fatima, it’s raining”. The actual amount of information conveyed by the message is subject to the context in which the message was emitted, namely the geographical location or particular circumstances.
In a desert environment, where rainfall is scarce and unexpected, the message would induce a large degree of surprise. Thus, it carries a large quantity of information because the occurrence has a low probability, therefore its observation is extremely relevant.
Conversely, in a city such as Udine, where rain is a frequent and typical occurrence, the message yields a smaller surprise because its high probability makes its occurrence less important or informational for the user, reducing the relevance and meaning for Fatima.

Shannon’s Problem: Message Compression

The primary goal behind Shannon’s initial study was to achieve message compression for efficient transmission over communication channels, like communication wires.
To minimize the length of a code required to represent frequent messages, high probability messages should be encoded using shorter code representations. In other terms, if one message is more frequently sent it makes sense to use small code so, the average code is also small.
To save bits and use better codes, messages with lower probabilities of occurrence can be represented using longer codes to save length overall as they rarely happen and won’t contribute to increasing overall average length.
Example: In the English alphabet, commonly occurring letters such as vowels (‘a’, ‘e’, ‘i’, ‘o’, ‘u’) should be assigned short code strings, while letters like ‘z’, and ‘v’ that have a low probability of occurrence, could be associated to a more complex long codes to achieve better overall compression when composing texts.

Maximum Entropy

Message compression is not possible when all messages in an alphabet are equiprobable (they all have equal probability). Intuitively, because all symbols have the same occurrence frequency, they need to be coded using the same length binary string, hindering overall compression.
This equiprobable case, in which compression becomes impossible because there are no symbols that have a more common usage with respect to other, is known as maximum entropy. This means that if every possible symbol is equiprobable we obtain maximal uncertainty over messages from such source.
Maximum entropy in information sources therefore corresponds to the uniform distribution over messages, representing a system that offers no predictive bias of events. This happens when symbols occur with the same probability within a text.

Actors in Information Theory

Huffman: Famous for creating the Huffman coding algorithm, an important code compression technique that creates prefix-free codes for text based on frequencies, in such way achieving optimality.
Lempel-Ziv: The creators of the first version of the popular ZIP algorithm, commonly employed to compress different file types on personal computers and other devices.
Kolmogorov: One of the contributors of the foundations of probability theory. He also introduced Kolmogorov complexity (also referred to as algorithmic complexity) for finite sequences based on minimal encoding required for their definition.
Bayes: Developed Bayesian inference, a crucial theory to perform probabilistic inference, which in machine learning and statistics provides a solid approach to statistical inference.

Uncertainty and Information

In a system characterized by maximum entropy or an uniform probability of symbols, information uncertainty is also maximized. Meaning that, without any additional information or preference among the message symbols, nothing about the message or event can be known or predicted.
When an environment presents uniform randomness (such as the symbols of an alphabet are equiprobable), there is no discernible information available for analysis, resulting in an inability to distinguish informative signals from the noise within a signal, having interesting applications when dealing with fake news.

The Library of Babel

A fictional short story by Jorge Luis Borges, the Library of Babel, provides an illustration of maximum entropy. It can also serve as a concrete application to analyze this aspect.
The library described within this short story encompasses all possible books of a fixed length that could exist from the fixed character set used. This leads to many book that look random.
This scenario indicates how all possible knowledge or information available at once has zero informational value for understanding the truth, due to redundancy or inability of any message to carry important information.

Entropy and Learning

Rule-Based Learning: This learning method involves rote memorization or the understanding of rules through core memory for instance the study of multiplication tables which the user remembers perfectly.
Example-Based Learning: In contrast, another kind of learning uses the trial and error learning mechanisms that relies on rewards and punishments through experimentation of the system, using training examples in which good or bad action bring a positive or a negative reinforcement.
Machine learning processes, particularly neural networks, are about finding, given some set of training examples, the probability distribution governing training samples in the training data set by adjusting the network parameters such as the weights, mimicking the behaviour observed in trial and error scenarios by training sets of examples, modifying the model to reproduce known results in the training data set.
In this kind of procedures, there are different divergence measures, directly connected to Shannon entropy, that evaluate differences among the learnt distribution against the actual or empirical distribution within samples. Therefore, divergence can guide the way a machine can learn an underlying distribution present within training samples.

Limitations of Probability

In real-world situations and examples, the perception of a messages’ informational content can not rely only on its pure probability. There are other factors that contribute and these go well beyond statistical considerations that can greatly influence how we perceive an incoming message.
The "Messi to Udinese" message illustrates the importance of the degree of plausibility in assessing the novelty of an incoming piece of information, when it goes against prior knowledge. Even with extremely low probabilities, sometimes messages can simply sound fake and discarded in advance.

Quantum World

The probability laws from the classic (macroscopic) world can not be directly used at the scale of the subatomic realm. Quantum mechanics and probabilistic laws completely reshape the world of probability, therefore we should also consider a different measure for the informational context.
Von Neumann entropy emerges as an expansion to Shannon’s measure that can be applicable in a broader variety of circumstances that cover scenarios within the world of quantum physics in which quantum probabilities require also a different interpretation of message entropies and informational content.

Probability Notions

Basic Definitions

Let $A = \{A_1, A_2, \dots, A_n\}$ be a finite set of events representing possible outcomes within a particular experiment or a random phenomenon. We will refer to it as a discrete probability space.

A probability distribution $P$ over $A$ is a function that assigns to each event $A_i$ a probability value $P(A_i)$ such that:

$P(A_i) > 0$ for all $i$ from $1$ to $n$ (each event has a strictly positive probability), and
$\sum_{i=1}^n P(A_i) = 1$ (the sum of probabilities of all events equals one).

Joint Probability

The joint probability of two events $A_i$ and $B_j$, denoted as $P(A_i, B_j)$, describes the probability of both events occurring together (or the simultaneous observation of two events when analyzing joint distributions from a probabilistic point of view).

Marginal Probability

The marginal probability of an event $A_i$ refers to the probability of $A_i$ occurring regardless of any specific event within another set B of random outcomes, where $B = \{B_1, \ldots, B_m\}$. It is obtained by summing up the joint probabilities $P(A_i, B_j)$ over all possible $B_j$, and it is formally defined as: \[P(A_i) = \sum_{j} P(A_i, B_j)\]

Conditional Probability

The conditional probability of an event $A_i$ occurring given that another event $B_j$ has already occurred is denoted as $P(A_i | B_j)$ and is computed as follows: \[P(A_i | B_j) = \frac{P(A_i, B_j)}{P(B_j)},\] where $P(B_j) > 0$. The expression is undefined for the $P(B_j)=0$.

Bayes’ Theorem

Starting from the definition of conditional probability, we can relate the joint probability of two events $A_i$ and $B_j$ in two different ways: \[P(A_i, B_j) = P(B_j) P(A_i | B_j) = P(A_i) P(B_j | A_i)\] Therefore, Bayes’ Theorem establishes a relation for conditional probabilities as: \[P(A_i | B_j) = \frac{P(B_j | A_i) P(A_i)}{P(B_j)}\]

Interpretation of Bayes’ Theorem

Bayes’ Theorem can be interpreted using the concepts of causes and effects:

$P(B_j | A_i)$: This is interpreted as the probability of the cause $B_j$ given the observation of the effect $A_i$. This is also known as the a posteriori (or backward) probability as it indicates a probability which uses available observation to improve the information of an initial or prior belief.
$P(A_i | B_j)$: Conversely, this is the probability of observing an effect $A_i$ given the assumption of the cause $B_j$, also known as a priori (or forward) probability. It provides a predictive approach for probability inference from some hypothesis before performing the analysis or gathering new data.
Example: In the scenario of extracting balls from a set of ballots with different colours:
- Effect ($A_i$): An effect would be drawing or observing the event that an extracted ball is white from some ballot (say the ballot numbered $i$).
- Cause ($B_j$): The different ballots can be interpreted as causes where a ball from a ballot has been selected for extraction. Different causes, might yield different outcomes (white or black balls).

Bayesian vs. Frequentist Probability

These two major statistical methodologies employ fundamentally different conceptual definitions:

Frequentist: The Frequentist perspective quantifies a probability as the long run frequency of occurrence of an event through extensive repeated experimental trials. In essence, probability measures can only be established with objective and measurable approaches using frequencies from experiments over large set of outcomes.
Bayesian: This perspective differs by defining probability as the degree of belief or confidence in the occurrence of a particular event or the truth of a proposition. It therefore introduces subjectivity into how we measure a probabilistic value.
- Key insight: It uses our existing assumptions (that could stem from historical data) on causes for establishing probabilistic measures on events or observations by incorporating new observations from our samples. Thus a priori beliefs on some set of events are influencing probabilistic values after they are processed by observed samples in a new environment (by taking posterior probabilities after some data collection).

Probabilities for sets of outcomes of experiments must sum up to 1.
Joint probabilities show how different experiments or samples are linked to some extent (simultaneous observations).
Marginal probabilities highlight event occurrence from experiments without requiring to take any specific observation from samples of the random variables being examined (by summing other conditional distributions)
Conditional probability reveals a probability from one variable if it has an association or conditional event in the values or occurrence of another one.
Bayes’ theorem serves to update or correct probabilities and build statistical inferences with real samples.

Formal Definition of Entropy

Surprise Effect and Information

There exists an inverse correlation between an event’s probability and its associated surprise effect: when an event’s probability is very small its novelty or unexpected observation is greatly increased (high surprise).
In contrast, a high probability event is usually interpreted as an everyday situation for which no additional learning information can be derived from observing its occurrence. In summary: low probability corresponds to high surprise and high information; and high probability means low surprise and low information.
The initial idea was that the quantity of information of some message from event $A_i$ could be inversely proportional to its probability, like $\frac{1}{P(A_i)}$. However, some extra constraints over our message quantity must be considered.
For this purpose a logarithmic transformation to $\frac{1}{P(A_i)}$ was introduced, for instance the function $-\log(P(A_i))$ maps the probability scale into $[0,\infty)$, allowing the study and analysis of the informational content to satisfy other practical and desirable mathematical properties as it will become clear below.

Shannon Entropy

Let $A = \{A_1, A_2, \dots, A_n\}$ be a finite set of events (the alphabet) and let $P$ be a probability distribution defined on $A$. The Shannon entropy, denoted as $H(P)$, quantifies the average quantity of information or uncertainty in such distribution, and it’s formally given as: \[H(P) = - \sum_{i=1}^n P(A_i) \log_2 P(A_i) = \sum_{i=1}^n P(A_i) \log_2 \frac{1}{P(A_i)}\] where the base 2 in logarithm results in an entropy unit expressed in bits. This is known as the Shannon Entropy formula and when computing with it logarithms are commonly taken using a base 2 system as they encode our observations into bits.

Example: Fair Coin

Consider a standard fair coin with a sample space or alphabet: $\{H, T\}$, with H for heads and T for tails respectively.
The probabilities for the fair coin are equiprobable therefore its probability distribution follows the definition $P(H) = P(T) = \frac{1}{2}$.
To measure the information or entropy in one coin flip under such conditions we use the Shannon entropy: \[H(P) = -\frac{1}{2} \log_2 \frac{1}{2} - \frac{1}{2} \log_2 \frac{1}{2} = \frac{1}{2} + \frac{1}{2} = 1 \text{ bit}\]
Interpretation: Every time we flip this fair coin, the outcome yields on average 1 bit of information. In short, one flip of a fair coin introduces a degree of surprise quantified to 1 bit (either a 0 or 1 for H or T), when seen in a digital data context.

Example: Biased Coin

Let’s now consider a biased or unfair coin. The probabilities for Heads will be denoted as $P(H) = p$ with Tails described by $P(T) = 1 - p$. The $p$ variable varies from 0 to 1 ($0 \leq p \leq 1$) excluding $\frac{1}{2}$, for all bias options except a fair coin.
The entropy in a single biased coin is expressed in function of a parameter p by substituting in the Shannon Entropy as follows: \[H(P) = -p \log_2 p - (1-p) \log_2 (1-p)\]
If the probability is $p = 1$ meaning that we only obtain heads ($H$), it means that we always get a predictable outcome without any new learning. For such special probability, the entropy will therefore become: \[H(P) = -1 \log_2 1 - 0 \log_2 0 = 0\] when analyzing the limit of $\lim_{x\to 0} x \log x = 0$. Therefore any bias, no matter how small always increases entropy.
Interpretation: This result conveys the fact that when a coin always yields Heads ($p=1$) we gain absolutely no new knowledge or new surprise by repeating our observations (a prior information or full predictability of a message brings 0 surprise).
As demonstrated later on with properties for this entropy function the measure is also concave. The entropy has also a unique maximum at $p = \frac{1}{2}$. This function exhibits different behaviors as demonstrated with our unfair and fair coins examples (as also depicted in 1).

Properties of Entropy

Let $P$ be a probability distribution over a finite set of events $A = \{A_1, \dots, A_K\}$ of cardinality $K$. Then, the following hold for entropy:

Non-negativity: The Shannon entropy $H(P) \geq 0$, i.e., the value is always a non-negative number for every possible distribution.
Zero Entropy Condition: $H(P) = 0$ if and only if there is at least one event $A_i$ in the distribution such that $P(A_i) = 1$ (deterministic event without randomness or with predictable outputs).
Upper bound: The maximum value $H(P) \leq \log_2 K$ is attained when the probability $P$ corresponds to a uniform distribution over all $K$ possible events or symbols such that $P(A_i) = \frac{1}{K}$ for each element in $i$.

Proof. Proof. (The complete details for proofs were not available from lecture notes) ◻

Shannon entropy for a biased coin as a function of (p), with a maximum at the uniform probability distribution of ().

Intuition Behind the Definition

Shannon’s entropy, represented as $H(P)$, can be understood as the average information gained when new observations are made, given a probabilistic process modeled by some particular distribution P for messages from a certain sample space $A$.
Each time a new event from our sample space is observed (e.g., by drawing symbols) this quantifies the information content per new sample of our message source, reflecting the uncertainty in every new occurrence using a measure.

Example: The 12 Coins Problem

Problem Statement

Consider a set of twelve coins, among them one of the coins differs from the other in terms of weight, the special or biased coin and its weight could be either lighter or heavier (with a difference that allows identification in the measuring tool), but it is unknown which condition holds, we only have 11 non-biased standard coins as our prior knowledge of them.
We also have available a two-pan balance, which compares sets of coins and outputs three distinct outcomes after placing groups on each pan: left side weighs more ($>$), the same ($=$), or less than ($<$). In all case each operation in balance offers 3 distinct outcomes (one on the left pan, and one on the right, together with equality) in any comparison performed using such weighing scale.
The aim or task at hand is to find the unique biased coin within the smallest possible set of weighings from the two-pan balance, to also discover whether the special or biased coin weighs less or more in contrast with the eleven standard coins.

Inefficient Strategy

There exist naive approaches when facing problems of searching over elements by assuming some sort of probabilistic lucky draw (with the intent to shorten problem times) for which can also become ineffective in many different experimental instances, as well as in terms of averages (across several problem instances) or as the algorithm approaches the worst performance instance during runtime, this happens by not learning useful information during weighings.
A strategy which tries a naive approach is comparing two coins assuming to get lucky (and that a biased coin has been sampled), if they yield distinct outcome using scale we perform a comparison using such with other coin until we discard all of the cases that do not yield positive results on bias coin, if our initial try doesn’t capture a bias we will proceed by picking two other random samples until success.

Optimal Strategy

When faced with our coin weighing problem, a winning (or optimal) algorithm for identification of such coin relies in generating an underlying strategy described using a decision structure similar to that of a balanced ternary tree. This is needed to achieve optimal or nearly optimal results for search processes of solutions, that make full use of information from the weighing operation (or experimental step), as we discard wrong assumptions to shorten the required analysis or experiments in problem instances of coin identifications.
For optimal (or good performance) algorithm a tree is constructed, where each node within the graph describes a state where comparisons or weighings are executed. Moreover the paths between those nodes, corresponds to experimental branches or an edge within graph of a single balancing and outcome from our pan, whose three options (greater, equal, less) make it have three child nodes, creating a ternary type tree.
In order to achieve optimal number of experiments and the shortest overall depth, the algorithm minimizes height. The maximum number of weighings we can perform, using a balanced structure in the case of decision problems such as a decision trees with several branches from our analysis or experiment.
When balanced, a tree with branches allows to discard more results than a badly balanced (asymmetric) structure. If a weighing experiment discards large portions of branches it reduces uncertainty about results that cannot exist from given outcomes, shortening solution by performing less work to reduce sample set.
The goal is, at each point within the search (or branch from tree), to take those decisions on measurements using scales, such that maximum information is obtained each step so, branches that correspond to false assumptions or states will not need to be explored. Such measure maximizes overall amount of acquired information from the set.

Connection to Information Theory

At the beginning with 12 coins to examine and no additional information, there are exactly 24 possibilities with all being equiprobable or equal occurrence chance to discover (12 coins for each two cases where weights can be less or greater), so each case having equal probability ($1/24$).
Each step in our tree search that uses our balance (or our experiments), yields three distinct outcomes from scale, these can modify the probabilities by discarding or limiting sets, leading towards the correct or unique branch of biased coin while refining a probabilistic distribution based on sample measurements from balance device.
Using each of these measurements from our scale device, helps our problem to refine, the probability distributions of each outcome that are now being conditioned to outcomes by weighing and reducing all possibilities until converging at solution. In fact each path from top of our search (initial condition or problem setup), brings some learning and new conditional sample that can be processed by our decision process as in standard statistical procedures.
When constructing a winning approach for problem solving, that minimizes weighings it follows in steps an algorithm which attempts at each experimental point the goal to achieve a high gain in terms of information, where information gain will follow the definitions by Shannon Entropy or any other divergence definition that relates with information from a source. This also leads to a reduction of uncertainty when we make weighings using the device in coin problem, moving closer to result (or convergence of analysis).
This means the goal at any step is maximizing what we learn for the problem with measurements obtained from using balance (using minimum weighings steps possible to arrive at answer by removing hypothesis), as a step from an optimization search strategy.

The connection between the 12-coins puzzle with entropy from Information theory as well with its connection with an optimal experimental strategy, and the formal connection, will be addressed with all necessary formal details during next lecture of the course, while also further expanding on other algorithms.

--- title: "Introduction to Information Theory and Computational Complexity" author: "Your Name" date: "2025-01-28" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Course Overview ## Resources and Materials - The primary course materials, including detailed references and a tentative program, are accessible via the designated e-learning platform. - Students have the option to request access to recordings of the previous year's lectures, which may offer improved visibility of the blackboard content. - For those interested in joining the old team to access previous recordings, send a request to be enrolled. ## Course Structure The course is structured into three interconnected parts, each building upon the previous one: 1. **Information Theory**: This section provides a foundational understanding of information, its quantification, and its application to diverse areas. - *Core Concept*: Introduction to Shannon entropy and its role in quantifying information. - *Applications*: Exploration of the practical use of Shannon entropy in two key domains: - Data compression, including code construction for efficient storage and transmission. - Inference processes within machine learning, emphasizing entropy reduction. - *Key Concepts*: The fundamental concepts to be covered include: - Entropy as a measure of uncertainty. - Coding theory for information representation. - Data compression algorithms and techniques. - The principles of inference and learning in the context of entropy. - *Primary References*: The theoretical basis will primarily rely on: - *"Elements of Information Theory"* by Cover and Thomas. A classic textbook that establishes fundamental concepts in Information Theory. - *"Information Theory, Inference, and Learning Algorithms"* by David J.C. MacKay. Focus on applications in inference and learning algorithms. - *"A Mathematical Theory of Communication"* by Claude E. Shannon. The original paper where the concept of Shannon entropy was introduced. - *Additional References*: Further references for specific coding methods such as Lempel-Ziv and other relevant topics. 2. **Computational Complexity**: This part shifts focus to the inherent difficulty of computational problems, exploring how efficient algorithms can be created for these. - *Core concept*: An overview of classical complexity theory, its central questions, and important classes of problems. - *Main Reference*: *"Computational Complexity"* by Christos H. Papadimitriou. We will focus on specific relevant chapters. - *Additional Materials*: There will also be some supplementary materials for additional insight in computational complexity 3. **Graph Algorithms and Data Structures (Time Permitting)**: This section will focus on techniques to solve graph problems with their structure and implementation. - *Focus Areas*: Should time permit, this will introduce key concepts related to: - Reachability as a basic property for graphs. - Representations and simulations. - Reduction in graph problems and the main challenges to these type of structures. ## Teaching Approach - The primary teaching method is through traditional blackboard lectures, offering a step-by-step approach to building an intuition for new concepts. - Given that blackboard visibility may not be optimal in this lecture room, the instructor advises accessing the prior year's recordings, where the board visibility is better. - The initial lecture provides a high-level and gentle introduction to information theory concepts with examples of messages. The remaining part of the course will focus on mathematical details, theorems and proofs. # Introduction to Shannon Entropy ## Motivation: The Concept of Information - The information content of a message is fundamentally linked to its *surprise effect*. This perspective departs from simply observing the meaning of the content to also measure how unexpected or informative a message might be. - The lower the probability of a message or event, the higher the surprise and consequently the information it carries. Intuitively, rare or unlikely messages tend to give new insights while common or more probable messages, are less informative. - In mathematical terms, the notion of a message surprise or novelty, can be expressed using its probability. The probability gives us an understanding about the importance or informational value a message might bring. ## Example: Fatima and Mr. Muller - Consider the example where Mr. Muller says, "Fatima, it's raining". The actual amount of information conveyed by the message is subject to the context in which the message was emitted, namely the geographical location or particular circumstances. - In a desert environment, where rainfall is scarce and unexpected, the message would induce a large degree of surprise. Thus, it carries a large quantity of information because the occurrence has a low probability, therefore its observation is extremely relevant. - Conversely, in a city such as Udine, where rain is a frequent and typical occurrence, the message yields a smaller surprise because its high probability makes its occurrence less important or informational for the user, reducing the relevance and meaning for Fatima. ## Shannon's Problem: Message Compression - The primary goal behind Shannon's initial study was to achieve message compression for efficient transmission over communication channels, like communication wires. - To minimize the length of a code required to represent frequent messages, high probability messages should be encoded using shorter code representations. In other terms, if one message is more frequently sent it makes sense to use small code so, the average code is also small. - To save bits and use better codes, messages with lower probabilities of occurrence can be represented using longer codes to save length overall as they rarely happen and won't contribute to increasing overall average length. - *Example*: In the English alphabet, commonly occurring letters such as vowels ('a', 'e', 'i', 'o', 'u') should be assigned short code strings, while letters like 'z', and 'v' that have a low probability of occurrence, could be associated to a more complex long codes to achieve better overall compression when composing texts. ## Maximum Entropy - Message compression is not possible when all messages in an alphabet are equiprobable (they all have equal probability). Intuitively, because all symbols have the same occurrence frequency, they need to be coded using the same length binary string, hindering overall compression. - This equiprobable case, in which compression becomes impossible because there are no symbols that have a more common usage with respect to other, is known as maximum entropy. This means that if every possible symbol is equiprobable we obtain maximal uncertainty over messages from such source. - Maximum entropy in information sources therefore corresponds to the uniform distribution over messages, representing a system that offers no predictive bias of events. This happens when symbols occur with the same probability within a text. ## Actors in Information Theory - **Huffman:** Famous for creating the *Huffman coding* algorithm, an important code compression technique that creates prefix-free codes for text based on frequencies, in such way achieving optimality. - **Lempel-Ziv:** The creators of the first version of the popular *ZIP algorithm*, commonly employed to compress different file types on personal computers and other devices. - **Kolmogorov:** One of the contributors of the foundations of *probability theory*. He also introduced Kolmogorov complexity (also referred to as algorithmic complexity) for finite sequences based on minimal encoding required for their definition. - **Bayes:** Developed *Bayesian inference*, a crucial theory to perform probabilistic inference, which in machine learning and statistics provides a solid approach to statistical inference. ## Uncertainty and Information - In a system characterized by maximum entropy or an uniform probability of symbols, information uncertainty is also maximized. Meaning that, without any additional information or preference among the message symbols, nothing about the message or event can be known or predicted. - When an environment presents uniform randomness (such as the symbols of an alphabet are equiprobable), there is no discernible information available for analysis, resulting in an inability to distinguish informative signals from the noise within a signal, having interesting applications when dealing with fake news. ## The Library of Babel - A fictional short story by Jorge Luis Borges, the *Library of Babel*, provides an illustration of maximum entropy. It can also serve as a concrete application to analyze this aspect. - The library described within this short story encompasses *all possible* books of a fixed length that could exist from the fixed character set used. This leads to many book that look random. - This scenario indicates how *all possible knowledge or information* available at once has zero informational value for understanding the truth, due to redundancy or inability of any message to carry important information. ## Entropy and Learning - **Rule-Based Learning**: This learning method involves rote memorization or the understanding of rules through core memory for instance the study of multiplication tables which the user remembers perfectly. - **Example-Based Learning:** In contrast, another kind of learning uses the trial and error learning mechanisms that relies on rewards and punishments through experimentation of the system, using training examples in which good or bad action bring a positive or a negative reinforcement. - Machine learning processes, particularly neural networks, are about finding, given some set of training examples, the probability distribution governing training samples in the training data set by adjusting the network parameters such as the weights, mimicking the behaviour observed in trial and error scenarios by training sets of examples, modifying the model to reproduce known results in the training data set. - In this kind of procedures, there are different *divergence measures*, directly connected to Shannon entropy, that evaluate differences among the learnt distribution against the actual or empirical distribution within samples. Therefore, divergence can guide the way a machine can learn an underlying distribution present within training samples. ## Limitations of Probability - In real-world situations and examples, the perception of a messages' informational content can not rely only on its pure probability. There are other factors that contribute and these go well beyond statistical considerations that can greatly influence how we perceive an incoming message. - The \"Messi to Udinese\" message illustrates the importance of the degree of plausibility in assessing the novelty of an incoming piece of information, when it goes against prior knowledge. Even with extremely low probabilities, sometimes messages can simply sound fake and discarded in advance. ## Quantum World - The probability laws from the classic (macroscopic) world can not be directly used at the scale of the subatomic realm. *Quantum mechanics* and probabilistic laws completely reshape the world of probability, therefore we should also consider a different measure for the informational context. - *Von Neumann entropy* emerges as an expansion to *Shannon's measure* that can be applicable in a broader variety of circumstances that cover scenarios within the world of quantum physics in which quantum probabilities require also a different interpretation of message entropies and informational content. # Probability Notions ## Basic Definitions ::: tcolorbox Let $A = \{A_1, A_2, \dots, A_n\}$ be a finite set of events representing possible outcomes within a particular experiment or a random phenomenon. We will refer to it as a discrete probability space. ::: ::: tcolorbox A probability distribution $P$ over $A$ is a function that assigns to each event $A_i$ a probability value $P(A_i)$ such that: 1. $P(A_i) > 0$ for all $i$ from $1$ to $n$ (each event has a strictly positive probability), and 2. $\sum_{i=1}^n P(A_i) = 1$ (the sum of probabilities of all events equals one). ::: ## Joint Probability ::: tcolorbox The **joint probability** of two events $A_i$ and $B_j$, denoted as $P(A_i, B_j)$, describes the probability of both events occurring together (or the simultaneous observation of two events when analyzing joint distributions from a probabilistic point of view). ::: ## Marginal Probability ::: tcolorbox The **marginal probability** of an event $A_i$ refers to the probability of $A_i$ occurring regardless of any specific event within another set B of random outcomes, where $B = \{B_1, \ldots, B_m\}$. It is obtained by summing up the joint probabilities $P(A_i, B_j)$ over all possible $B_j$, and it is formally defined as: $$P(A_i) = \sum_{j} P(A_i, B_j)$$ ::: ## Conditional Probability ::: tcolorbox The **conditional probability** of an event $A_i$ occurring given that another event $B_j$ has already occurred is denoted as $P(A_i | B_j)$ and is computed as follows: $$P(A_i | B_j) = \frac{P(A_i, B_j)}{P(B_j)},$$ where $P(B_j) > 0$. The expression is undefined for the $P(B_j)=0$. ::: ## Bayes' Theorem ::: tcolorbox Starting from the definition of conditional probability, we can relate the joint probability of two events $A_i$ and $B_j$ in two different ways: $$P(A_i, B_j) = P(B_j) P(A_i | B_j) = P(A_i) P(B_j | A_i)$$ Therefore, **Bayes' Theorem** establishes a relation for conditional probabilities as: $$P(A_i | B_j) = \frac{P(B_j | A_i) P(A_i)}{P(B_j)}$$ ::: ## Interpretation of Bayes' Theorem Bayes' Theorem can be interpreted using the concepts of causes and effects: - **$P(B_j | A_i)$**: This is interpreted as the probability of the cause $B_j$ given the observation of the effect $A_i$. This is also known as the *a posteriori* (or backward) probability as it indicates a probability which uses available observation to improve the information of an initial or prior belief. - **$P(A_i | B_j)$**: Conversely, this is the probability of observing an effect $A_i$ given the assumption of the cause $B_j$, also known as *a priori* (or forward) probability. It provides a predictive approach for probability inference from some hypothesis before performing the analysis or gathering new data. - *Example*: In the scenario of extracting balls from a set of ballots with different colours: - **Effect ($A_i$)**: An effect would be drawing or observing the event that an extracted ball is white from some ballot (say the ballot numbered $i$). - **Cause ($B_j$)**: The different ballots can be interpreted as causes where a ball from a ballot has been selected for extraction. Different causes, might yield different outcomes (white or black balls). ## Bayesian vs. Frequentist Probability These two major statistical methodologies employ fundamentally different conceptual definitions: - **Frequentist**: The Frequentist perspective quantifies a probability as the long run *frequency of occurrence of an event* through extensive repeated experimental trials. In essence, probability measures can only be established with objective and measurable approaches using frequencies from experiments over large set of outcomes. - **Bayesian**: This perspective differs by defining probability as the *degree of belief or confidence in the occurrence of a particular event or the truth of a proposition*. It therefore introduces subjectivity into how we measure a probabilistic value. - *Key insight*: It uses our existing assumptions (that could stem from historical data) on causes for establishing probabilistic measures on events or observations by incorporating new observations from our samples. Thus *a priori* beliefs on some set of events are influencing probabilistic values after they are processed by observed samples in a new environment (by taking posterior probabilities after some data collection). ::: tcolorbox - Probabilities for sets of outcomes of experiments must sum up to 1. - Joint probabilities show how different experiments or samples are linked to some extent (simultaneous observations). - Marginal probabilities highlight event occurrence from experiments without requiring to take any specific observation from samples of the random variables being examined (by summing other conditional distributions) - Conditional probability reveals a probability from one variable if it has an association or conditional event in the values or occurrence of another one. - Bayes' theorem serves to update or correct probabilities and build statistical inferences with real samples. ::: # Formal Definition of Entropy ## Surprise Effect and Information - There exists an inverse correlation between an event's probability and its associated *surprise effect*: when an event's probability is very small its novelty or unexpected observation is greatly increased (high surprise). - In contrast, a high probability event is usually interpreted as an everyday situation for which no additional learning information can be derived from observing its occurrence. In summary: low probability corresponds to high surprise and high information; and high probability means low surprise and low information. - The initial idea was that the *quantity of information* of some message from event $A_i$ could be inversely proportional to its probability, like $\frac{1}{P(A_i)}$. However, some extra constraints over our message quantity must be considered. - For this purpose a logarithmic transformation to $\frac{1}{P(A_i)}$ was introduced, for instance the function $-\log(P(A_i))$ maps the probability scale into $[0,\infty)$, allowing the study and analysis of the informational content to satisfy other practical and desirable mathematical properties as it will become clear below. ## Shannon Entropy ::: tcolorbox Let $A = \{A_1, A_2, \dots, A_n\}$ be a finite set of events (the alphabet) and let $P$ be a probability distribution defined on $A$. The **Shannon entropy**, denoted as $H(P)$, quantifies the average quantity of information or uncertainty in such distribution, and it's formally given as: $$H(P) = - \sum_{i=1}^n P(A_i) \log_2 P(A_i) = \sum_{i=1}^n P(A_i) \log_2 \frac{1}{P(A_i)}$$ where the base 2 in logarithm results in an entropy unit expressed in *bits*. This is known as the Shannon Entropy formula and when computing with it logarithms are commonly taken using a base 2 system as they encode our observations into bits. ::: ## Example: Fair Coin - Consider a standard *fair coin* with a sample space or alphabet: $\{H, T\}$, with H for heads and T for tails respectively. - The probabilities for the fair coin are equiprobable therefore its probability distribution follows the definition $P(H) = P(T) = \frac{1}{2}$. - To measure the information or entropy in one coin flip under such conditions we use the *Shannon entropy*: $$H(P) = -\frac{1}{2} \log_2 \frac{1}{2} - \frac{1}{2} \log_2 \frac{1}{2} = \frac{1}{2} + \frac{1}{2} = 1 \text{ bit}$$ - **Interpretation**: Every time we flip this fair coin, the outcome yields on average 1 bit of information. In short, one flip of a fair coin introduces a degree of surprise quantified to 1 bit (either a 0 or 1 for H or T), when seen in a digital data context. ## Example: Biased Coin - Let's now consider a *biased or unfair* coin. The probabilities for Heads will be denoted as $P(H) = p$ with Tails described by $P(T) = 1 - p$. The $p$ variable varies from 0 to 1 ($0 \leq p \leq 1$) excluding $\frac{1}{2}$, for all bias options except a fair coin. - The entropy in a single biased coin is expressed in function of a *parameter p* by substituting in the Shannon Entropy as follows: $$H(P) = -p \log_2 p - (1-p) \log_2 (1-p)$$ - If the probability is $p = 1$ meaning that we only obtain heads ($H$), it means that we always get a predictable outcome without any new learning. For such special probability, the entropy will therefore become: $$H(P) = -1 \log_2 1 - 0 \log_2 0 = 0$$ when analyzing the limit of $\lim_{x\to 0} x \log x = 0$. Therefore any bias, no matter how small always increases entropy. - **Interpretation**: This result conveys the fact that when a coin always yields Heads ($p=1$) we gain absolutely no new knowledge or new surprise by repeating our observations (a prior information or full predictability of a message brings 0 surprise). - As demonstrated later on with properties for this entropy function the measure is also *concave*. The entropy has also a unique maximum at $p = \frac{1}{2}$. This function exhibits different behaviors as demonstrated with our unfair and fair coins examples (as also depicted in [1](#fig:binary_entropy){reference-type="ref+label" reference="fig:binary_entropy"}). ## Properties of Entropy ::: tcolorbox Let $P$ be a probability distribution over a finite set of events $A = \{A_1, \dots, A_K\}$ of cardinality $K$. Then, the following hold for entropy: 1. **Non-negativity:** The Shannon entropy $H(P) \geq 0$, i.e., the value is always a non-negative number for every possible distribution. 2. **Zero Entropy Condition:** $H(P) = 0$ if and only if there is at least one event $A_i$ in the distribution such that $P(A_i) = 1$ (deterministic event without randomness or with predictable outputs). 3. **Upper bound:** The maximum value $H(P) \leq \log_2 K$ is attained when the probability $P$ corresponds to a *uniform distribution* over all $K$ possible events or symbols such that $P(A_i) = \frac{1}{K}$ for each element in $i$. ::: ::: proof *Proof.* (The complete details for proofs were not available from lecture notes) ◻ ::: <figure id="fig:binary_entropy"> <figcaption>Shannon entropy for a biased coin as a function of <span class="math inline">$p$</span>, with a maximum at the uniform probability distribution of <span class="math inline">$\frac{1}{2}$</span>.</figcaption> </figure> ## Intuition Behind the Definition - Shannon's entropy, represented as $H(P)$, can be understood as the *average information gained when new observations are made*, given a probabilistic process modeled by some particular distribution P for messages from a certain sample space $A$. - Each time a new event from our sample space is observed (e.g., by drawing symbols) this quantifies the *information content per new sample of our message source*, reflecting the uncertainty in every new occurrence using a measure. # Example: The 12 Coins Problem ## Problem Statement - Consider a set of twelve coins, among them one of the coins differs from the other in terms of weight, the special or *biased coin* and its weight could be either lighter or heavier (with a difference that allows identification in the measuring tool), but it is unknown which condition holds, we only have 11 non-biased standard coins as our prior knowledge of them. - We also have available a *two-pan balance*, which compares sets of coins and outputs three distinct outcomes after placing groups on each pan: left side weighs more ($>$), the same ($=$), or less than ($<$). In all case each operation in balance offers 3 distinct outcomes (one on the left pan, and one on the right, together with equality) in any comparison performed using such weighing scale. - The aim or task at hand is to find the unique biased coin within the smallest possible set of weighings from the two-pan balance, to also discover whether the special or biased coin weighs less or more in contrast with the eleven standard coins. ## Inefficient Strategy - There exist naive approaches when facing problems of searching over elements by assuming some sort of probabilistic lucky draw (with the intent to shorten problem times) for which can also become ineffective in many different experimental instances, as well as in terms of averages (across several problem instances) or as the algorithm approaches the worst performance instance during runtime, this happens by not learning useful information during weighings. - A strategy which tries a naive approach is comparing two coins assuming to get lucky (and that a biased coin has been sampled), if they yield distinct outcome using scale we perform a comparison using such with other coin until we discard all of the cases that do not yield positive results on bias coin, if our initial try doesn't capture a bias we will proceed by picking two other random samples until success. ## Optimal Strategy - When faced with our coin weighing problem, a winning (or optimal) algorithm for identification of such coin relies in generating an underlying strategy described using a decision structure similar to that of *a balanced ternary tree*. This is needed to achieve optimal or nearly optimal results for search processes of solutions, that make full use of information from the weighing operation (or experimental step), as we discard wrong assumptions to shorten the required analysis or experiments in problem instances of coin identifications. - For optimal (or good performance) algorithm a tree is constructed, where each node within the graph describes a state where comparisons or weighings are executed. Moreover the paths between those nodes, corresponds to experimental branches or an edge within graph of a single balancing and outcome from our pan, whose three options (greater, equal, less) make it have three child nodes, creating a ternary type tree. - In order to achieve optimal number of experiments and the shortest overall depth, the algorithm minimizes height. The maximum number of weighings we can perform, using a balanced structure in the case of decision problems such as a *decision trees* with several branches from our analysis or experiment. - When balanced, a tree with branches allows to discard more results than a badly balanced (asymmetric) structure. If a weighing experiment discards large portions of branches it reduces uncertainty about results that cannot exist from given outcomes, shortening solution by performing less work to reduce sample set. - The goal is, at each point within the search (or branch from tree), to take those decisions on measurements using scales, such that maximum information is obtained each step so, branches that correspond to false assumptions or states will not need to be explored. Such measure maximizes overall amount of acquired information from the set. ## Connection to Information Theory - At the beginning with 12 coins to examine and no additional information, there are exactly 24 possibilities with all being equiprobable or equal occurrence chance to discover (12 coins for each two cases where weights can be less or greater), so each case having equal probability ($1/24$). - Each step in our tree search that uses our balance (or our experiments), yields three distinct outcomes from scale, these can modify the probabilities by discarding or limiting sets, leading towards the correct or unique branch of biased coin while refining a probabilistic distribution based on sample measurements from balance device. - Using each of these measurements from our scale device, helps our problem to refine, the probability distributions of each outcome that are now being conditioned to outcomes by weighing and reducing all possibilities until converging at solution. In fact each path from top of our search (initial condition or problem setup), brings some learning and new conditional sample that can be processed by our decision process as in standard statistical procedures. - When constructing a winning approach for problem solving, that minimizes weighings it follows in steps an algorithm which attempts at each experimental point the goal to achieve a high *gain in terms of information*, where *information gain* will follow the definitions by Shannon Entropy or any other *divergence* definition that relates with information from a source. This also leads to a *reduction of uncertainty* when we make weighings using the device in coin problem, moving closer to result (or convergence of analysis). - This means the goal at any step is maximizing what we *learn* for the problem with measurements obtained from using balance (using minimum weighings steps possible to arrive at answer by removing hypothesis), as a step from an optimization search strategy. ::: tcolorbox The connection between the 12-coins puzzle with entropy from Information theory as well with its connection with an optimal experimental strategy, and the formal connection, will be addressed with all necessary formal details during next lecture of the course, while also further expanding on other algorithms. :::