Lecture Notes: Information Theory and Kolmogorov Complexity

Author

Your Name

Published

January 28, 2025

Review of Shannon-Fano and Huffman Codes

In the previous lecture, we introduced Shannon-Fano and Huffman codes. We demonstrated through an example that Huffman coding can achieve superior compression compared to Shannon-Fano in certain scenarios. While we will not provide a formal proof here, it is a well-established fact that Huffman coding is indeed optimal for prefix codes.

Consider the game "Who is Who?", where the objective is to identify a hidden character by asking a series of yes/no questions. The most effective strategy involves posing questions that divide the remaining possible characters into two roughly equal groups. This approach is analogous to the construction of a Shannon-Fano code, where at each step, the set of possibilities is split as evenly as possible. For instance, if half of the remaining characters wear glasses and half do not, asking "Does the character wear glasses?" is a good strategy. If only one character has red hair while all the others are blonde, asking about hair color is less effective as it does not split the possibilities evenly. This strategy is akin to the splitting process in Shannon-Fano coding, where we aim to divide the probability mass as evenly as possible.

When dealing with a uniform probability distribution, the construction of both Shannon-Fano and Huffman codes leads to the same result. In games like "Who is Who?", where the initial knowledge is limited and all remaining possibilities are equally likely, the strategy of splitting the possibilities in half effectively constructs an optimal code. This is because, under a uniform distribution, both Shannon-Fano and Huffman codes perform identically and achieve the theoretical limit of compression. Specifically, when the probabilities are uniform, the splitting process in Shannon-Fano aligns perfectly with the optimal tree structure that Huffman coding would generate. Thus, in such scenarios, the seemingly simpler Shannon-Fano method achieves the same optimal compression as the more complex Huffman algorithm.

Asymptotic Optimality of Shannon Codes

We previously established that the expected length of a Shannon code, denoted as \(L_S\), is bounded by the entropy \(\mathcal{H}(P)\) and \(\mathcal{H}(P)+ 1\): \[\label{eq:shannon_code_length_bound} \mathcal{H}(P)\leq L_S < \mathcal{H}(P)+ 1\] We also observed instances where the Shannon code exhibited suboptimal performance, particularly when the probability distribution of the source alphabet is far from uniform. To address this, we now explore the application of the Shannon coding technique to blocks of letters, rather than individual letters. This approach aims to leverage the properties of longer sequences to achieve better compression.

Let \(A\) be the original alphabet, and let \(A^n\) represent the set of all possible strings of length \(n\) formed from the alphabet \(A\). We define the Shannon code over \(n\), denoted by \(S_n\), as a function \(S_n: A^n \rightarrow B^*\), where \(B\) is the output alphabet. This function maps each string of length \(n\) from \(A\) to a codeword in \(B^*\). In essence, instead of encoding individual symbols from \(A\), we are now encoding sequences of \(n\) symbols from \(A\). This allows us to consider the joint probabilities of these sequences, which can lead to more efficient encoding when the original distribution is not uniform.

Let \(P\) be the original probability distribution over the alphabet \(A\). Given that the source is stationary and memoryless, the probability distribution \(P_n\) over \(A^n\) is determined by the product of the individual probabilities: \[\label{eq:joint_probability} P_n(a_1 a_2 \dots a_n) = P(a_1) P(a_2) \dots P(a_n)\] This equation reflects the fact that, for a memoryless source, the probability of a sequence is simply the product of the probabilities of its constituent symbols.

Consider an alphabet \(A = \{A, B\}\) with probabilities \(P(A) = 1 - \frac{1}{32}\) and \(P(B) = \frac{1}{32}\). For \(n=2\), the set of all possible strings is \(A^2 = \{AA, AB, BA, BB\}\). The corresponding probabilities, calculated using [eq:joint_probability], are:

  • \(P_2(AA) = \left(1 - \frac{1}{32}\right)^2\)

  • \(P_2(AB) = P_2(BA) = \left(1 - \frac{1}{32}\right) \cdot \frac{1}{32}\)

  • \(P_2(BB) = \left(\frac{1}{32}\right)^2\)

The length of the encoding for \(AA\) is given by \(\left\lceil \log_2 \frac{1}{P_2(AA)} \right\rceil\), and for \(BB\), it is \(\left\lceil \log_2 \frac{1}{P_2(BB)} \right\rceil = \left\lceil \log_2 (32^2) \right\rceil\). Notice that the probability of \(BB\) is significantly smaller than that of \(AA\), which will result in a longer codeword for \(BB\) under Shannon coding.

Let \(L_{S_n}\) represent the expected length of the Shannon code when applied to strings of length \(n\). We have: \[\label{eq:shannon_code_length_n} \mathcal{H}(P_n) \leq L_{S_n} < \mathcal{H}(P_n) + 1\] This inequality is a direct application of the Shannon code length bound [eq:shannon_code_length_bound] to the alphabet \(A^n\) with distribution \(P_n\). Since the source is memoryless, the entropy of the joint distribution is the sum of the individual entropies, i.e., \(\mathcal{H}(P_n) = n \mathcal{H}(P)\). Thus, \[\label{eq:shannon_code_length_n_simplified} n \mathcal{H}(P)\leq L_{S_n} < n \mathcal{H}(P)+ 1\] This equation shows that the expected length of the code for a sequence of \(n\) symbols is approximately \(n\) times the entropy of the original distribution. To determine the average number of bits per letter in the original alphabet, we divide by \(n\): \[\label{eq:average_bits_per_letter} \mathcal{H}(P)\leq \frac{L_{S_n}}{n} < \mathcal{H}(P)+ \frac{1}{n}\] As \(n\) approaches infinity (\(n \to \infty\)), the term \(\frac{1}{n}\) approaches zero (\(\frac{1}{n} \to 0\)), and consequently, \(\frac{L_{S_n}}{n}\) converges to \(\mathcal{H}(P)\). This demonstrates that the Shannon code is asymptotically optimal. In other words, by encoding longer and longer sequences, we can make the average code length per symbol arbitrarily close to the entropy of the source.

By applying the Shannon coding technique to increasingly longer blocks of letters, we effectively mitigate the inefficiencies caused by non-uniform distributions. This approach brings the performance of the Shannon code closer to the theoretical limit defined by the entropy, thus achieving asymptotic optimality. The key idea is that as the block size \(n\) increases, the distribution \(P_n\) over \(A^n\) tends to become more uniform, making the Shannon code more efficient. This is because the relative differences in probabilities between different sequences become less pronounced as the sequence length grows.

Lempel-Ziv Codes

Lempel-Ziv (LZ) codes represent a distinct approach to data compression, offering an alternative to Shannon-Fano and Huffman codes. These codes are categorized as variable-length to block codes. This means that the input message is divided into segments of varying lengths, and each segment is then encoded into a block of fixed length. This is in contrast to Huffman and Shannon-Fano codes, which are block-to-variable length codes, where fixed-length input symbols are mapped to variable-length codewords.

  • LZ77 (1977): This algorithm employs a dynamic dictionary that is constructed from the message itself during the encoding process. The dictionary is essentially a sliding window of previously encoded text.

  • LZ78 (1978): Similar to LZ77, but it utilizes a static dictionary, which remains constant throughout the encoding. This dictionary is built incrementally as new phrases are encountered.

  • Lempel-Ziv-Welch (LZW, 1983): This is a simplified version of LZ78, designed for easier implementation. LZW is widely used due to its efficiency and ease of implementation.

Many of the LZ algorithms were subject to patents, which led to various legal and commercial issues. For instance, the LZW algorithm was patented, which caused some controversy regarding its use in formats like GIF. However, these patents have now expired, making the algorithms freely available for use. This expiration has facilitated the widespread adoption of these algorithms in various applications.

LZ77 Algorithm

The LZ77 algorithm operates using two sliding windows: a search buffer and a look-ahead buffer. These buffers move along the input message as the encoding progresses.

  • Search Buffer: This buffer contains the portion of the message that has been recently encoded. It acts as a dynamic dictionary, providing a reference for finding repeating patterns.

  • Look-Ahead Buffer: This buffer holds the next part of the message that is to be encoded. The algorithm attempts to find matches for prefixes of this buffer within the search buffer.

The core of the LZ77 algorithm involves identifying the longest prefix of the look-ahead buffer that matches a substring within the search buffer. The encoding is represented by triples of the form \((O, L, S)\), where:

  • \(O\) (Offset): Represents the distance between the starting position of the look-ahead buffer (\(I\)) and the starting position of the matching prefix in the search buffer (\(J\)). Specifically, \(O = I - J\). This offset indicates how far back in the search buffer the matching prefix begins.

  • \(L\) (Length): Indicates the length of the matching prefix. This specifies how many characters in the look-ahead buffer match the substring in the search buffer.

  • \(S\) (Symbol): Denotes the first character in the look-ahead buffer that does not match the prefix found in the search buffer. This is the character that follows the matching prefix.

Consider the message "BAB BABA BBA BBA BBA". Assume that both the search and look-ahead buffers have a size of 5. The encoding process proceeds as follows:

  1. Initial State: The search buffer is initially empty (filled with a special symbol, denoted as \(\triangleright\)), and the look-ahead buffer contains "BABBA". The longest matching prefix in the search buffer is of length 0, so the first triple is \((1, 0, B)\).

  2. Step 2: The search buffer now contains "\(\triangleright\)B", and the look-ahead buffer is "AB BAB". The longest matching prefix is again of length 0, so the triple is \((1, 0, A)\).

  3. Step 3: The search buffer is "\(\triangleright\)BA", and the look-ahead buffer is "BBABA". The longest matching prefix is "B" at offset 2, so the triple is \((2, 1, B)\).

  4. Step 4: The search buffer is "\(\triangleright\)BABB", and the look-ahead buffer is "ABABB". The longest matching prefix is "AB" at offset 3, so the triple is \((3, 2, A)\).

  5. Step 5: The search buffer is "\(\triangleright\)BBABA", and the look-ahead buffer is "BBABB". The longest matching prefix is "BBAB" at offset 5, so the triple is \((5, 4, B)\).

The encoded message is the sequence of triples: \((1, 0, B), (1, 0, A), (2, 1, B), (3, 2, A), (5, 4, B)\).

The dynamic dictionary in LZ77 is the search buffer, which evolves as the encoding and decoding processes progress. Both the encoding and decoding algorithms for LZ77 have a linear time complexity with respect to the length of the input message, assuming the buffer sizes are constant. This makes them efficient for practical applications, as the time required to encode or decode a message grows linearly with its size.

Initialize: Search buffer \(SB\) with special symbols, Look-ahead buffer \(LB\) with the start of the message. \(I \gets\) current position in the message (start of \(LB\)) Find the longest prefix of \(LB\) that matches a substring in \(SB\) \(J \gets\) starting position of the matching prefix in \(SB\) \(L \gets\) length of the matching prefix \(O \gets I - J\) \(S \gets\) first mismatching character in \(LB\) after the prefix Output the triple \((O, L, S)\) Move \(SB\) and \(LB\) forward by \(L+1\) positions \(I \gets\) new starting position of \(LB\)

Initialize: Search buffer \(SB\) with special symbols, Output message \(M\) is empty Read the next triple \((O, L, S)\) \(J \gets\) current position in \(SB\) - \(O\) Copy \(L\) characters from \(SB\) starting at position \(J\) to \(M\) Append \(S\) to \(M\) Update \(SB\) with the newly decoded characters from \(M\) Output \(M\)

Kolmogorov Complexity

Kolmogorov complexity, introduced by Andrey Kolmogorov in 1955, provides a measure of the descriptive complexity of an object. It quantifies the minimum amount of information needed to describe a given object, focusing on the shortest program that can generate it. This concept is deeply rooted in the theory of computation and provides a theoretical foundation for understanding the notion of randomness and information content.

The Kolmogorov complexity of a string \(x\), denoted by \(K(x)\), is defined as the length of the shortest binary program that, when executed on a universal Turing machine, produces \(x\) as its output. The length of the program is measured in bits. This definition implies that the complexity of a string is not an intrinsic property of the string itself, but rather depends on the computational model (i.e., the universal Turing machine) used to generate it. A shorter program implies a lower Kolmogorov complexity, indicating that the string is less complex.

Kolmogorov complexity is intrinsically linked to the concept of the shortest possible encoding of an object. It is grounded in a model of computation, specifically Turing machines, which provide a formal framework for defining what it means to compute or generate a string. Unlike Shannon’s entropy, which is based on probability distributions, Kolmogorov complexity is based on the shortest description of an object, making it a more fundamental measure of complexity. The shortest program that generates a string can be seen as its ultimate compression.

Turing Machines

A Turing machine \(M\) is formally defined as a tuple \((Q, \Sigma, \delta, s)\), where:

  • \(Q\) is a finite set of states. These states represent the different configurations the machine can be in during its computation.

  • \(\Sigma\) is a finite alphabet that includes special symbols such as the start symbol \(\triangleright\) and the blank symbol \(\sqcup\). The alphabet contains the symbols that can be written on the tape.

  • \(\delta: Q \times \Sigma \to \Sigma \times \{L, R, -\} \times Q\) is the transition function. This function dictates the machine’s behavior based on its current state and the symbol it is reading. The transition function is a mapping from a state and a symbol to a new symbol, a direction to move the tape head, and a new state.

  • \(s \in Q\) is the starting state of the machine. This is the initial state of the machine before any computation begins.

The transition function \(\delta(q, a) = (b, D, q')\) specifies that if the machine is in state \(q\) and reads symbol \(a\), it will:

  • Write symbol \(b\) on the tape, replacing \(a\).

  • Move the tape head in direction \(D\), which can be Left (\(L\)), Right (\(R\)), or Stay (\(-\)).

  • Transition to state \(q'\).

This process is repeated until the machine reaches a halting state.

Consider a Turing machine designed to determine if a binary number \(X\) is even. The machine begins at the leftmost symbol, moves right until it reaches the end of the number (indicated by a blank symbol), then moves back one step to check the last digit. If the last digit is 0, the machine transitions to a "yes" state, indicating the number is even; otherwise, it transitions to a "no" state, indicating the number is odd. This example demonstrates the basic operation of a Turing machine in performing a simple computation.

Another example is a Turing machine that checks if a binary string is a palindrome. The machine reads the first and last symbols, compares them, and then repeats the process on the inner substring. If all pairs of symbols match, the machine transitions to a "yes" state, indicating the string is a palindrome; otherwise, it transitions to a "no" state. This example illustrates how a Turing machine can perform more complex pattern recognition tasks.

Universal Turing Machine

There exists a universal Turing machine \(U\) that can take as input a binary encoding of any other Turing machine \(M\) and an input \(X\) for \(M\). The universal machine \(U\) then simulates the behavior of \(M\) when run on input \(X\). This means that \(U\) can execute any program that can be executed by any other Turing machine. The existence of a universal Turing machine is a fundamental result in the theory of computation.

The universal Turing machine acts as an interpreter or compiler. It is capable of executing any Turing machine, provided that the Turing machine is encoded in a suitable format. This concept is crucial for understanding the theoretical limits of computation and the definition of Kolmogorov complexity. The universal Turing machine allows us to define a single machine that can simulate any other machine, which is essential for the definition of Kolmogorov complexity.

Schematic representation of a Turing Machine and a Universal Turing Machine. The universal Turing machine takes an encoding of a Turing machine and its input, and produces the same output as the original Turing machine.

Further Reading

For a more in-depth understanding of Turing machines and other models of computation, we recommend consulting the provided book excerpt (in Italian). This excerpt offers a detailed introduction to the formal definitions and concepts related to computation, including various models and their properties. It covers the theoretical foundations necessary to grasp the concepts discussed in this lecture, particularly Kolmogorov complexity.

Additionally, there are numerous equivalent resources available in English that cover the same material. These resources can provide further clarification and examples to enhance your understanding of these fundamental topics. Some recommended topics to explore include:

  • Formal definitions of Turing machines, including their components and transition functions.

  • Examples of Turing machines for various computational tasks, such as arithmetic operations and string manipulation.

  • The concept of a universal Turing machine and its implications for computability.

  • Alternative models of computation, such as finite automata, pushdown automata, and lambda calculus.

  • The Church-Turing thesis, which posits that all reasonable models of computation are equivalent to Turing machines.

Exploring these resources will provide a broader and more comprehensive understanding of the theoretical underpinnings of computer science and information theory.