KMP Algorithm and the Z-Algorithm

Author

Your Name

Published

February 8, 2025

Introduction

In this lecture, we will explore two efficient string searching algorithms: the Knuth-Morris-Pratt (KMP) algorithm and the Z-algorithm. Our discussion will begin with a review of the KMP algorithm, focusing on its core principles and time complexity. We will then introduce the Z-algorithm as an alternative approach to string matching, highlighting its unique methodology and broader applicability. We will delve into the definition and computation of the Z-function, and demonstrate how the Z-algorithm can be applied to solve the exact string matching problem. Finally, we will touch upon the relationship between the KMP and Z-algorithms, and briefly consider real-time processing aspects for both.

This lecture aims to:

  • Recap the key concepts and complexity of the KMP algorithm.

  • Introduce the Z-algorithm and the Z-function.

  • Explain how to use the Z-algorithm for exact string matching.

  • Describe the efficient computation of the Z-function.

  • Briefly discuss the connection between KMP and Z-algorithms.

  • Consider real-time processing implications for both algorithms.

KMP Algorithm: A Quick Recap

The Knuth-Morris-Pratt (KMP) algorithm is a linear-time string searching algorithm designed to efficiently find all occurrences of a pattern string \(P\) within a text string \(T\). The core idea behind KMP’s efficiency is to avoid redundant comparisons by leveraging information gained from previous matches. This is achieved through pre-processing the pattern \(P\) to compute a crucial auxiliary data structure called the prefix function (also known as the border array or failure function).

For a pattern \(P\) of length \(m\), the prefix function, denoted as \(sp(P)\) or \(\pi(P)\), is an array of length \(m\). For each position \(i\) (\(1 \le i \le m\)), \(sp(P)[i]\) (or \(sp_i(P)\)) is defined as the length of the longest proper prefix of \(P[1..i]\) that is also a suffix of \(P[1..i]\). A proper prefix of a string is a prefix that is not equal to the string itself.

In simpler terms, for each position \(i\) in the pattern \(P\), \(sp(P)[i]\) tells us the length of the longest string that is both a prefix and a suffix of the substring \(P[1..i]\), excluding \(P[1..i]\) itself. This information is critical for determining how much we can shift the pattern to the right when a mismatch occurs during the string matching process.

The prefix function essentially encodes the self-overlapping structure of the pattern. By knowing the prefix function, when a mismatch occurs at some position in the text, KMP uses the precomputed values to decide where to resume the pattern matching from, without having to backtrack in the text.

The computation of the prefix function \(sp(P)\) for a pattern \(P\) of length \(m\) can be done efficiently in \(O(m)\) time using dynamic programming. Once the prefix function is computed, the KMP algorithm can search for all occurrences of \(P\) in a text \(T\) of length \(n\) in \(O(n)\) time. Therefore, the overall time complexity of the KMP algorithm, including both pre-processing and searching, is \(O(n+m)\). This linear time complexity makes KMP significantly more efficient than naive string searching algorithms, especially when dealing with large texts and patterns or in scenarios where mismatches are frequent.

The efficiency of KMP arises from the intelligent use of the prefix function to minimize comparisons. In a naive string matching approach, upon a mismatch, we might shift the pattern by just one position and restart the comparison from the beginning of the pattern. KMP, however, utilizes the prefix function to make larger shifts, skipping over portions of the text that are guaranteed not to yield a match based on the structure of the pattern itself. This optimized shifting is what leads to the linear time complexity and the practical effectiveness of the KMP algorithm.

Introducing the Z-Algorithm

Now, let’s turn our attention to the Z-algorithm, another linear-time string algorithm. The Z-algorithm is based on computing the Z-function of a string. The notes suggest that the Z-algorithm is a "more general" approach compared to KMP. Let’s explore this by defining the Z-function and examining its applications.

The Z-Function

Given a string \(A\) of length \(n\). The Z-function of \(A\), denoted as \(Z(A)\) or simply \(Z\), is an array of integers of length \(n\). For each index \(i \in \{1, 2, ..., n\}\), \(Z[i]\) is the length of the longest common prefix (LCP) between the string \(A\) and the suffix of \(A\) starting at position \(i\), i.e., \(A[i..n]\). \[Z[i] = \text{length of LCP}(A[1..n], A[i..n])\]

In essence, \(Z[i]\) tells us how much of the string \(A\) starting from the first character matches with the string \(A\) starting from the \(i\)-th character. If \(Z[i] = k > 0\), it means that the first \(k\) characters of \(A\) are exactly the same as the first \(k\) characters of the suffix of \(A\) starting at position \(i\). If \(Z[i] = 0\), it means that the first character of \(A\) is different from the \(i\)-th character of \(A\), or there is no common prefix.

Let \(A = \text{"abacaba"}\). We calculate the Z-function \(Z(A)\) as follows:

  • \(Z[1] = \text{length of LCP}(\text{"abacaba"}, \text{"abacaba"}) = 7\). The entire string is a prefix of itself.

  • \(Z[2] = \text{length of LCP}(\text{"abacaba"}, \text{"bacaba"}) = 0\). The first characters ‘a’ and ‘b’ do not match.

  • \(Z[3] = \text{length of LCP}(\text{"abacaba"}, \text{"acaba"}) = 1\). Only the first character ‘a’ matches.

  • \(Z[4] = \text{length of LCP}(\text{"abacaba"}, \text{"caba"}) = 0\). The first characters ‘a’ and ‘c’ do not match.

  • \(Z[5] = \text{length of LCP}(\text{"abacaba"}, \text{"aba"}) = 3\). The prefix "aba" matches.

  • \(Z[6] = \text{length of LCP}(\text{"abacaba"}, \text{"ba"}) = 0\). The first characters ‘a’ and ‘b’ do not match.

  • \(Z[7] = \text{length of LCP}(\text{"abacaba"}, \text{"a"}) = 1\). Only the first character ‘a’ matches.

Thus, \(Z(\text{"abacaba"}) = [7, 0, 1, 0, 3, 0, 1]\).

Exact String Matching using Z-Algorithm

One of the primary applications of the Z-algorithm is in exact string matching. To find all occurrences of a pattern \(P\) in a text \(T\), we ingeniously construct a new string \(S\) by concatenating the pattern \(P\), a special separator character \(\$\), and the text \(T\). The separator character \(\$\) must be chosen such that it does not appear in either the pattern \(P\) or the text \(T\). A common choice for \(\$\) is a character that is outside the alphabet of \(P\) and \(T\). Let \(S = P\$T\). We then compute the Z-function for this constructed string \(S\).

After computing \(Z(S)\), we look for indices \(i\) in the Z-array such that \(Z[i]\) is equal to the length of the pattern \(P\), i.e., \(Z[i] = |P|\). If we find such an index \(i\), and if \(i\) falls in the portion of \(S\) that corresponds to the text \(T\) (meaning \(i\) is beyond the pattern \(P\) and the separator \(\$\), specifically for \(i\) in the range \(|P|+2 \le i \le |S|\)), it signifies that we have found an occurrence of the pattern \(P\) in the text \(T\). The starting position of this occurrence in \(T\) (using 1-based indexing) is \(i - |P| - 1\).

The reason this works is that if \(Z[i] = |P|\), it means that the string \(S[1..|P|]\) (which is exactly \(P\)) is equal to the string \(S[i..i+|P|-1]\). Since we constructed \(S = P\$T\), if \(i > |P|+1\), then \(S[i..i+|P|-1]\) is a substring within \(T\). The condition \(Z[i] = |P|\) ensures that this substring is exactly equal to \(P\). The separator \(\$\) is crucial to prevent matches that might span across the boundary between \(P\) and \(T\).

\(S \gets P\$T\) \(Z \gets \text{Zcalc}(S)\) Output "Pattern found at position" \(i - |P| - 1\)

Z-Function Calculation

The power of the Z-algorithm lies not only in its conceptual simplicity but also in the efficiency with which the Z-function can be computed. A naive calculation of \(Z[i]\) for each \(i\) from 1 to \(n\) would involve comparing prefixes and suffixes character by character, leading to an \(O(n^2)\) time complexity for computing the entire Z-function. However, the Zcalc algorithm, presented below, cleverly computes the Z-function in linear time, \(O(n)\), by reusing previously computed values.

The algorithm employs a "sliding window" or "interval of prefix match" approach. It maintains two variables, \(l\) and \(r\), which define an interval \([l, r]\) that represents the rightmost extent of a prefix match found so far. Specifically, \([l, r]\) is the interval such that \(l\) is the starting position of some suffix in \(S\) (where a prefix match was found), \(r\) is the largest index such that \(S[l..r]\) is a prefix of \(S\), and \(r\) is maximized over all such intervals found so far. Initially, \(l=1\) and \(r=0\), indicating no prefix match interval has been established yet.

The algorithm iterates through the string \(S\) from the second position (\(i=2\)) to the end. For each position \(i\), it calculates \(Z[i]\) using one of two cases, based on whether the current position \(i\) falls within the current prefix match interval \([l, r]\).

\(n \gets \text{length}(S)\) Initialize \(Z\) as an array of size \(n\), all zeros. \(l \gets 1\), \(r \gets 0\) \(j \gets 0\) \(j \gets j + 1\) \(Z[i] \gets j\) \(l \gets i\) \(r \gets i + Z[i] - 1\) \(k \gets i - l + 1\) \(Z[i] \gets Z[k]\) \(j \gets \max(0, r - i + 1)\) \(j \gets j + 1\) \(Z[i] \gets j\) \(l \gets i\) \(r \gets i + Z[i] - 1\) \(Z\)

Complexity Analysis of Zcalc

The Zcalc algorithm achieves linear time complexity, \(O(n)\), where \(n\) is the length of the input string \(S\). To understand this, we analyze how the indices \(i\) and \(r\) change throughout the execution. The outer loop iterates for \(i\) from 2 to \(n\), so \(i\) increases \(n-1\) times. The crucial observation is how the rightmost endpoint \(r\) of the prefix match interval behaves.

Initially, \(r=0\). Inside the loop, \(r\) is updated only when a new non-zero \(Z[i]\) is found, in both Case 1 and Case 2b. In Case 1, \(r\) becomes \(i + Z[i] - 1 = i + j - 1\), where \(j = Z[i]\) is the length of the newly found prefix match. In Case 2b, \(r\) is updated to \(i + Z[i] - 1 = i + j - 1\), where \(j = Z[i]\) is the extended match length. In both cases, if \(Z[i] > 0\), \(r\) is set to a value at least \(i-1\). Importantly, \(r\) is always non-decreasing. It starts at 0 and can increase up to at most \(n-1\).

Consider the number of character comparisons. Comparisons happen inside the ‘while’ loops in Case 1 and Case 2b. Each successful comparison in these loops leads to an increment of \(j\), and subsequently, potentially to an increase in \(r\). Since \(r\) can increase at most to \(n-1\), the total number of successful comparisons across all iterations of the ‘while’ loops is bounded by \(O(n)\). For unsuccessful comparisons, in each iteration of the outer loop, there is at most one unsuccessful comparison that terminates the inner ‘while’ loop. Thus, the number of unsuccessful comparisons is also at most \(O(n)\).

Therefore, the total number of character comparisons is linear in \(n\). All other operations in the algorithm (initializations, assignments, comparisons of indices, subtractions, additions, and maximum operations) take constant time. Hence, the overall time complexity of the Zcalc algorithm is \(O(n)\).

Correctness of Z-Algorithm

The correctness of the Z-algorithm relies on the efficient and accurate computation of each \(Z[i]\) value. The algorithm correctly calculates \(Z[i]\) by considering two cases: when the current index \(i\) is outside the reach of the previously computed prefix match interval \([l, r]\) (Case 1), and when it is inside (Case 2).

In Case 1 (\(i > r\)), the algorithm performs a naive comparison from scratch to find the longest common prefix between \(S\) and \(S[i..n]\). This is the base case where no prior information is used.

In Case 2 (\(i \le r\)), the algorithm leverages the information from the interval \([l, r]\). It uses the value \(Z[k]\), where \(k = i - l + 1\). If \(Z[k]\) is small enough (Case 2a), such that the prefix match starting at position \(k\) in \(S\) does not extend beyond the interval \([1, r-l+1]\), then we can directly conclude that \(Z[i] = Z[k]\). This is because of the prefix property: \(S[l..r] = S[1..r-l+1]\). So, \(S[i..r]\) corresponds to \(S[k..r-l+1]\). If \(Z[k]\) is within the length of \(S[k..r-l+1]\), then the prefix match at \(i\) is guaranteed to be of the same length as at \(k\).

If \(Z[k]\) is large (Case 2b), such that it extends up to or beyond \(r-l+1\), it means the potential prefix match at \(i\) might extend beyond \(r\). In this case, we know that at least a prefix of length \(r-i+1\) matches (because \(S[i..r] = S[k..r-l+1] = S[1..r-i+1]\)). The algorithm then continues to compare characters starting from \(S[r+1]\) and \(S[r-i+2]\) (corrected to \(S[1+j]\) with appropriate \(j\)) to extend the match and find the true \(Z[i]\).

By correctly handling these two cases and efficiently updating the interval \([l, r]\), the Zcalc algorithm ensures that each \(Z[i]\) is computed accurately and efficiently in linear time.

Conceptual Relation between KMP and Z-Algorithm

The lecture notes touch upon a "Reduction of KMP to Z," suggesting a deeper connection between these two linear-time string algorithms. While it’s not a reduction in the strict computational complexity sense (where one algorithm directly uses the other as a subroutine to achieve a complexity result), the relationship is more about shared algorithmic design principles and conceptual similarities in how they achieve efficiency.

Both the Knuth-Morris-Pratt (KMP) algorithm and the Z-algorithm are celebrated for their linear time complexity in string searching and related problems. This efficiency stems from their ability to avoid redundant character comparisons by intelligently leveraging information about prefixes and previously computed matches. At a high level, both algorithms embody a form of dynamic programming or memoization, where results from prior computations are used to speed up subsequent steps.

In KMP, the cornerstone is the prefix function, \(sp(P)\). This function, precomputed for the pattern \(P\), encapsulates knowledge about the internal structure of \(P\), specifically its prefix-suffix overlaps. When a mismatch occurs during the search in the text \(T\), KMP consults the prefix function to determine the optimal shift of the pattern. This shift is not arbitrary; it’s precisely calculated to align the longest possible prefix of \(P\) that is also a suffix of the already-matched portion of the text, thus preventing unnecessary backtracking and comparisons.

The Z-algorithm, on the other hand, revolves around the Z-function, \(Z(S)\), of a string \(S\). The Z-function identifies, for each position \(i\) in \(S\), the length of the longest prefix of \(S\) that matches a prefix of the suffix starting at \(i\). The efficient computation of the Z-function, achieved by the Zcalc algorithm, is where the algorithm’s ingenuity shines. It uses the interval \([l, r]\) to keep track of the rightmost extent of a prefix match found so far. By checking if the current index \(i\) falls within this interval, the algorithm can reuse previously computed Z-values to either directly determine \(Z[i]\) or to initialize the comparison process, significantly reducing redundant work.

Conceptual Parallel between KMP and Z-Algorithm

While there isn’t a straightforward reduction where you can directly plug in the Z-algorithm to compute the KMP prefix function, or vice versa, the "reduction" hinted at in the notes is more about recognizing that both algorithms stem from a similar philosophy of efficient string processing. The techniques employed in the Z-algorithm, particularly the dynamic maintenance of the \([l, r]\) interval and the reuse of Z-values, resonate with the optimization strategies in KMP. Understanding the Z-algorithm can indeed provide a fresh perspective on string matching and potentially inspire new ways to think about or optimize KMP-related problems.

The notes’ mention of modified prefix functions \(sp'(P)\) and \(sp'_{i,x}(P)\) further suggests an exploration of hybrid approaches or variations that might draw upon the strengths of both KMP and Z-algorithm paradigms, especially in specialized scenarios like real-time string processing. The quest for even more efficient or adaptable string algorithms often involves cross-pollination of ideas and techniques from algorithms like KMP and Z, which, despite their differences, share a common goal of linear-time string processing through intelligent prefix management.

Real-time Considerations for KMP and Z-Algorithm

An important aspect in algorithm design is their applicability to real-time or online scenarios, where input data arrives sequentially, as a stream, rather than being available all at once. The lecture notes pose the question: "How do KMP and Z-algorithms behave if the text \(T\) arrives as a stream of characters?" Let’s delve into the real-time capabilities of both algorithms.

KMP for Streaming Text

The Knuth-Morris-Pratt (KMP) algorithm is inherently well-suited for real-time string matching. Its design allows it to process the text \(T\) character by character as it arrives, without needing to know the entire text in advance. This makes KMP a natural fit for streaming applications.

In essence, KMP elegantly decouples the pattern analysis (precomputation) from the text processing (online matching). This decoupling is what enables its real-time capability. For each incoming character, KMP makes a decision based on its current state and the precomputed prefix function, allowing it to efficiently determine if a pattern occurrence is completed or if the matching process should continue.

Z-Algorithm and Streaming Text: Challenges and Adaptations

In contrast to KMP, the standard Z-algorithm, as described by Zcalc, is designed to compute the Z-function for a complete, static string. Directly applying Zcalc to a streaming text presents challenges because the algorithm, in its original form, assumes random access to the entire string.

The lecture notes mention modified prefix functions \(sp'(P)\) and \(sp'_{i,x}(P)\), which could be relevant to real-time adaptations. These modifications might offer a way to optimize state transitions in a KMP-like algorithm or to incorporate Z-algorithm-like efficiency into processing streaming text. The definitions suggest focusing on optimizing shifts based on the next character encountered in the text stream, aiming for constant-time processing per character.

In conclusion, while KMP is inherently real-time ready, the standard Z-algorithm requires significant adaptation to handle streaming text. Research into real-time string algorithms often explores modifications and hybrid approaches to combine the strengths of algorithms like KMP and Z, seeking efficient and memory-friendly solutions for processing continuous streams of textual data. The modified prefix functions hinted at in the notes likely represent steps in this direction, aiming to achieve faster and more responsive real-time string matching.

Conclusion

In this lecture, we have explored two fundamental and efficient string algorithms: the Knuth-Morris-Pratt (KMP) algorithm and the Z-algorithm. We began by revisiting the KMP algorithm, emphasizing its linear time complexity and the crucial role of the prefix function in optimizing string searching. We then introduced the Z-algorithm, focusing on the concept of the Z-function and how it can be efficiently computed in linear time using the Zcalc algorithm. We demonstrated the application of the Z-algorithm to the exact string matching problem, showcasing its versatility as an alternative to KMP.

Throughout the lecture, we highlighted the conceptual connections between KMP and the Z-algorithm, noting their shared principles of leveraging prefix information and avoiding redundant comparisons to achieve linear time complexity. While a direct computational reduction from KMP to Z-algorithm (or vice versa) may not be straightforward, both algorithms embodya similar spirit of efficient string processing through dynamic programming-like techniques.

We also addressed the important aspect of real-time processing, emphasizing KMP’s inherent suitability for streaming text data. We discussed the challenges in directly applying the standard Z-algorithm to streaming scenarios and briefly touched upon potential adaptation strategies and the role of modified prefix functions, such as \(sp'(P)\) and \(sp'_{i,x}(P)\), in optimizing real-time string matching.

The Z-algorithm, often considered more conceptually straightforward and sometimes easier to implement than KMP, offers a powerful and versatile tool for a range of string problems beyond just exact string matching. Its linear time complexity and elegant approach make it a valuable addition to the toolkit of any computer scientist or programmer dealing with string manipulation and pattern matching tasks.

Follow-up Questions and Topics for Further Exploration:

Exercises

  1. Calculate Z-function: Compute the Z-function for the string "abababa" manually, following the definition of the Z-function. Verify your result using the Zcalc algorithm.

  2. String Matching with Z-Algorithm: Use the Z-algorithm to find all occurrences of the pattern "aba" in the text "abacababa". Show the constructed string \(S\), the computed Z-function \(Z(S)\), and identify the positions where the pattern is found.

  3. Implement Z-Algorithm: Implement the Zcalc algorithm in your favorite programming language. Ensure your implementation correctly computes the Z-function for various input strings. Test it with examples, including edge cases like empty strings, strings with no repeating prefixes, and strings with long prefix repetitions.

  4. Performance Comparison:

    1. Implement a naive string matching algorithm that compares the pattern with every possible substring of the text.

    2. Implement the Z-algorithm based string matching using your Zcalc implementation.

    3. Conduct experiments to compare the execution time of both algorithms for different pattern lengths (e.g., lengths 5, 10, 20, 50) and text lengths (e.g., lengths 100, 1000, 10000, 100000). Use randomly generated strings or real-world text data.

    4. Plot graphs to visualize the performance difference as pattern and text lengths vary. Analyze and discuss your observations. When does the Z-algorithm significantly outperform the naive approach?

  5. (Advanced) KMP Prefix Function and Z-Function Relationship:

    1. Research and describe in detail the relationship between the KMP prefix function and the Z-function.

    2. Can the KMP prefix function be computed efficiently using the Z-algorithm, or vice-versa? If yes, provide an algorithm or method. If not, explain why and discuss the conceptual similarities and differences that lead to their respective efficiencies.

    3. Explore if there are hybrid approaches that combine the strengths of both algorithms for specific string processing tasks.

  6. (Research) Applications Beyond Exact Matching:

    1. Investigate and list at least three applications of the Z-algorithm beyond exact string matching. For each application, briefly describe the problem and how the Z-algorithm is used to solve it efficiently. Examples might include:

      • Finding all palindromic substrings

      • Longest repeating substring

      • String compression algorithms

      • Bioinformatics sequence alignment

    2. Choose one application from your list and delve deeper. Explain the algorithm in detail, analyze its time complexity, and discuss its advantages and limitations compared to other approaches for the same problem.