Kolmogorov Complexity and Its Relation to Shannon Entropy
Introduction
In the previous lecture, we initiated our discussion on Kolmogorov complexity. To deepen your understanding, we recommend the following resources:
Information Theory: A Tutorial Introduction by Stone.
Elements of Information Theory by Cover and Thomas. Specifically, Chapter 7 of this book provides an in-depth treatment of Kolmogorov complexity.
These references cover essential topics including source coding, Shannon’s first theorem, and a detailed exploration of Kolmogorov complexity. This lecture will further expand on our previous discussion, focusing on the descriptive nature of Kolmogorov complexity and its connection to Shannon entropy. We aim to provide a clear and structured explanation of these concepts, building upon the foundation laid in the previous session.
Kolmogorov Complexity
Kolmogorov complexity provides a way to quantify the complexity of an object by measuring the length of the shortest program required to generate it. This approach offers a notion of descriptive complexity, particularly useful when considering strings or other data structures.
Turing Machines and Universal Turing Machines
The foundation of Kolmogorov complexity lies in the concept of a model of computation. In this context, we utilize Turing machines (TMs) as our model. For those less familiar with TMs, it can be helpful to think of them as analogous to programs written in a language like Java.
A crucial element is the Universal Turing Machine (UTM), which is a TM capable of simulating any other TM. A UTM takes as input the encoding of another TM, denoted as \(\langle \mathcal{M}\rangle\), along with an input \(X\) for that machine. The UTM then simulates the behavior of \(\mathcal{M}\) when given input \(X\). This is akin to a compiler that takes a program and executes it.
Input to UTM: A pair consisting of \(\langle \mathcal{M}\rangle\), the binary encoding of a TM \(\mathcal{M}\), and an input \(X\) for \(\mathcal{M}\).
Action of UTM: Simulates the execution of \(\mathcal{M}\) on input \(X\).
It’s important to note that there are infinitely many UTMs. These machines can be modified without altering their fundamental input-output behavior. For instance, adding instructions that do nothing in a Java program does not change the program’s output.
Encoding a Turing Machine
A Turing machine is formally defined by the following components:
A finite set of states, denoted by \(Q\).
A finite alphabet of symbols, denoted by \(\Sigma\).
A transition function, denoted by \(\delta\).
A designated starting state, denoted by \(q_0\).
To encode a TM into a binary string, we need to represent these components in a suitable format. This involves:
Encoding the cardinality of the set of states, \(|Q|\).
Encoding the cardinality of the alphabet, \(|\Sigma|\).
Encoding the transition function \(\delta\), which maps a state and a symbol to a new state, a new symbol, and a direction (Left, Right, or Stay). Formally, \(\delta: Q \times \Sigma \to Q \times \Sigma \times \{L, R, S\}\).
These components can be encoded as numbers, and we can use special separator symbols within the UTM’s alphabet (e.g., 0, 1, start, blank, separator) to distinguish between different parts of the encoding.
Definition of Kolmogorov Complexity
Given an alphabet \(\Sigma\) and a Universal Turing Machine \(\mathcal{U}\), the Kolmogorov complexity of a string \(X \in \Sigma^*\), denoted by \(\mathcal{K}_\mathcal{U}(X)\), is defined as the length of the shortest binary encoding of a Turing machine \(\mathcal{M}\) that, when executed by \(\mathcal{U}\), produces \(X\) as output. Formally: \[\mathcal{K}_\mathcal{U}(X) = \min \{ |\langle \mathcal{M}\rangle| : \mathcal{U}(\langle \mathcal{M}\rangle) = X \}\] where \(\langle \mathcal{M}\rangle\) represents the binary encoding of the Turing machine \(\mathcal{M}\), and \(|\langle \mathcal{M}\rangle|\) denotes the length of this encoding in bits.
This definition describes the Kolmogorov complexity of a string as the length of the most concise program that can generate that string.
Conditional Kolmogorov Complexity
The conditional Kolmogorov complexity of a string \(X\) given a string \(Y\), denoted by \(\mathcal{K}_\mathcal{U}(X \mid Y)\), is the length of the shortest binary encoding of a Turing machine \(\mathcal{M}\) that, when executed by \(\mathcal{U}\) with input \(Y\), produces \(X\) as output. Formally: \[\mathcal{K}_\mathcal{U}(X \mid Y) = \min \{ |\langle \mathcal{M}\rangle| : \mathcal{U}(\langle \mathcal{M}\rangle, Y) = X \}\]
This definition describes the conditional Kolmogorov complexity of a string X given a string Y as the length of the most concise program that can generate X when given Y as input.
Providing additional information, such as the length of a string, can significantly reduce its Kolmogorov complexity. For instance, generating a string of 100 ones can be achieved with a short program that iterates 100 times, printing "1" each time. This is more efficient than a program that explicitly lists 100 ones without knowing the length.
For any strings \(X\) and \(Y\), the conditional Kolmogorov complexity is always less than or equal to the unconditional Kolmogorov complexity: \[\mathcal{K}_\mathcal{U}(X \mid Y) \le \mathcal{K}_\mathcal{U}(X)\]
This proposition states that knowing additional information (Y) can only reduce the complexity of generating X.
Dependence on the Universal Turing Machine
If \(\mathcal{U}\) and \(\mathcal{A}\) are two Universal Turing Machines, then there exists a constant \(c_{\mathcal{A}, \mathcal{U}}\) that depends only on \(\mathcal{A}\) and \(\mathcal{U}\) such that: \[\mathcal{K}_\mathcal{U}(X \mid Y) \le \mathcal{K}_\mathcal{A}(X \mid Y) + c_{\mathcal{A}, \mathcal{U}}\]
This proposition states that the Kolmogorov complexity is independent of the specific UTM used, up to an additive constant.
This proposition implies that the Kolmogorov complexity is independent of the specific UTM used, up to an additive constant. Therefore, we can often write \(\mathcal{K}(X \mid Y)\) without explicitly specifying the UTM, as the choice of UTM only affects the complexity by a constant factor.
Non-Computability of Kolmogorov Complexity
A fundamental limitation of Kolmogorov complexity is that it is not computable. This is a consequence of the Halting Problem, a well-known result in computability theory.
There exists no algorithm that can determine, for any given Turing machine \(\mathcal{M}\) and input \(X\), whether \(\mathcal{M}\) will eventually halt (terminate) when executed on \(X\).
This theorem states that it is impossible to create a general algorithm that can determine whether any given Turing Machine will halt for a given input.
The non-computability of Kolmogorov complexity means that we cannot create an algorithm that, given a string \(X\), will compute its Kolmogorov complexity. Instead, we can only estimate or provide bounds for the Kolmogorov complexity of specific strings.
Estimates of Kolmogorov Complexity
While the exact Kolmogorov complexity of a string is not computable, we can establish some useful bounds and estimates. These estimates provide insights into the relationship between the length of a string and its Kolmogorov complexity.
For any string \(X\), the Kolmogorov complexity of \(X\) is bounded above by its length plus a constant: \[\mathcal{K}(X) \le |X| + c\] where \(c\) is a constant that depends on the chosen universal Turing machine.
This proposition provides an upper bound for the Kolmogorov complexity of a string.
Proof. Proof. To prove this, we consider a modified Universal Turing Machine \(\mathcal{U}'\), which we call a "printer machine". This machine operates as follows:
If the input to \(\mathcal{U}'\) starts with a ‘0’, \(\mathcal{U}'\) interprets the rest of the input as a string and outputs that string directly.
If the input to \(\mathcal{U}'\) starts with a ‘1’, \(\mathcal{U}'\) interprets the rest of the input as the encoding of a Turing machine and simulates that machine.
Now, for any string \(X\), we can construct an input for \(\mathcal{U}'\) by prepending a ‘0’ to \(X\), resulting in the input \(0X\). When \(\mathcal{U}'\) receives \(0X\) as input, it will output \(X\). The length of the input \(0X\) is \(|X| + 1\). Therefore, the length of the shortest program that produces \(X\) is at most \(|X| + 1\). This implies that \(\mathcal{K}(X) \le |X| + 1\). The constant \(c\) can be adjusted to account for the specific encoding details of the UTM. ◻
There exists at least one string \(X\) such that its Kolmogorov complexity is greater than or equal to its length: \[\exists X : \mathcal{K}(X) \ge |X|\]
This proposition states that there exist strings whose Kolmogorov complexity is at least as large as their length.
Proof. Proof. Let’s consider the number of Turing machines with encodings of length less than \(|X|\). The number of such machines is given by the sum of the number of binary strings of length \(i\) for \(i\) from 0 to \(|X|-1\), which is: \[\sum_{i=0}^{|X|-1} 2^i = 2^{|X|} - 1\] On the other hand, the number of strings of length \(|X|\) is \(2^{|X|}\). Since each Turing machine can produce at most one output string, and there are more strings of length \(|X|\) than there are machines of length less than \(|X|\), it follows from the Pigeonhole Principle that there must exist at least one string \(X\) of length \(|X|\) that cannot be produced by any machine with an encoding shorter than \(|X|\). Therefore, for this string \(X\), we have \(\mathcal{K}(X) \ge |X|\). ◻
Strings that have a Kolmogorov complexity close to their length are often considered "random" or incompressible. This is because the shortest program to generate such strings is essentially a literal description of the string itself, bit by bit. There is no shorter, more concise way to describe these strings.
Connection between Kolmogorov Complexity and Shannon Entropy
This section explores the deep connection between Kolmogorov complexity and Shannon entropy, two fundamental concepts in information theory. We will show how these seemingly different measures of information are related through the idea of encoding and decoding.
Kolmogorov Encoding
Let \(A\) be an input alphabet, and let \(\mathcal{U}\) be a fixed Universal Turing Machine. We can define a Kolmogorov encoding that maps strings from \(A^*\) to binary strings in \(\{0, 1\}^*\). This encoding is defined as follows: \[\text{Encode}(X) = \langle \mathcal{M}\rangle\] where \(\langle \mathcal{M}\rangle\) is the binary encoding of the shortest Turing machine \(\mathcal{M}\) such that \(\mathcal{U}(\langle \mathcal{M}\rangle) = X\). In other words, the Kolmogorov encoding of a string \(X\) is the binary encoding of the shortest program that generates \(X\) when executed on the UTM \(\mathcal{U}\).
This encoding has the property of being uniquely decodable, meaning that given the encoded string, we can uniquely recover the original string. Furthermore, this encoding can be made prefix-free, which is a desirable property for codes.
First Direction: Kolmogorov to Shannon
Consider the Kolmogorov code for strings of length \(n\) from \(A^n\) to \(\{0, 1\}^*\). Let \(P\) be a probability distribution over the input alphabet \(A\). By Shannon’s first theorem (the source coding theorem), the expected length of any uniquely decodable code for strings of length \(n\) is bounded below by the entropy of the source. In the case of the Kolmogorov code, this means: \[\mathbb{E}[|\text{Encode}(X)|] \ge H_P(A^n) = n H(P)\] where \(H_P(A^n)\) is the entropy of the source producing strings of length \(n\), and \(H(P)\) is the entropy of the probability distribution \(P\) over the alphabet \(A\). The expectation is taken over all strings \(X\) of length \(n\).
Since the length of the Kolmogorov encoding of \(X\) is the Kolmogorov complexity of \(X\), we can rewrite the above inequality as: \[\mathbb{E}[\mathcal{K}(X)] \ge n H(P)\] Dividing both sides by \(n\), we get: \[\frac{\mathbb{E}[\mathcal{K}(X)]}{n} \ge H(P)\] This inequality shows that the average Kolmogorov complexity per symbol of a string of length \(n\) is bounded below by the Shannon entropy of the source.
Second Direction: Shannon to Kolmogorov
Now, let’s consider a uniquely decodable code \(P: A^n \to \{0, 1\}^*\). We can think of a UTM \(\mathcal{U}\) that takes as input the code \(P(X)\) and a binary string representing the decoder for \(P\), and outputs \(X\). This means that we can construct a program that, given the encoded string and the decoder, can generate the original string.
The Kolmogorov complexity of \(X\) is then bounded above by the length of the encoded string plus the length of the decoder: \[\mathcal{K}(X) \le |P(X)| + |\text{Decoder for } P|\] Taking the expectation over all strings of length \(n\), we get: \[\mathbb{E}[\mathcal{K}(X)] \le \mathbb{E}[|P(X)|] + |\text{Decoder for } P|\] For Shannon codes, we know that the expected length of the encoding is bounded by the entropy plus one: \[\mathbb{E}[|P(X)|] \le n H(P) + 1\] The length of the decoder for a Shannon code is essentially the Kolmogorov complexity of the probability distribution \(P\), which we denote as \(\mathcal{K}(P)\). Thus, we have: \[\mathbb{E}[\mathcal{K}(X)] \le n H(P) + 1 + \mathcal{K}(P)\] Dividing by \(n\), we get: \[\frac{\mathbb{E}[\mathcal{K}(X)]}{n} \le H(P) + \frac{1}{n} + \frac{\mathcal{K}(P)}{n}\]
Combining the Directions
Combining the two directions, we have the following bounds: \[H(P) \le \frac{\mathbb{E}[\mathcal{K}(X)]}{n} \le H(P) + \frac{1}{n} + \frac{\mathcal{K}(P)}{n}\] As \(n\) approaches infinity, the terms \(\frac{1}{n}\) and \(\frac{\mathcal{K}(P)}{n}\) go to zero. Therefore, we have: \[\lim_{n \to \infty} \frac{\mathbb{E}[\mathcal{K}(X)]}{n} = H(P)\] This result demonstrates that the expected Kolmogorov complexity per symbol of a long string approaches the Shannon entropy of the source. This connection highlights the fundamental relationship between these two measures of information.
Conclusion
In this lecture, we have explored the concept of Kolmogorov complexity, a measure of descriptive complexity that quantifies the length of the shortest program needed to generate a given object. We have also examined its profound connection to Shannon entropy, a measure of the average information content of a source.
We have demonstrated that Kolmogorov complexity is fundamentally linked to Shannon entropy through the concepts of encoding and decoding. By considering the Kolmogorov encoding, we established a lower bound on the average Kolmogorov complexity per symbol in terms of Shannon entropy. Conversely, by considering Shannon codes, we derived an upper bound on the average Kolmogorov complexity per symbol, also related to Shannon entropy.
The key result is that, as the length of the strings increases, the average Kolmogorov complexity per symbol of a long string approaches the Shannon entropy of the source. This convergence highlights the deep and intrinsic relationship between these two seemingly different measures of information. Kolmogorov complexity provides a program-centric view of information, while Shannon entropy provides a statistical view. The fact that they converge in the limit underscores the fundamental nature of information and its representation.
This connection not only enriches our understanding of both concepts but also provides a bridge between the theoretical world of computability and the practical world of information theory. The relationship is summarized in 1.