Rabin-Karp Algorithm for Pattern Matching

Author

Your Name

Published

February 8, 2025

Introduction

Pattern matching is a fundamental problem in computer science with wide-ranging applications, including text searching, bioinformatics, and network security. The core task is to locate instances of a specific pattern, which is a short string, within a larger text string.

Consider the problem of finding all occurrences of a pattern $P$ of length $m$ in a text $T$ of length $n$. A straightforward approach is the naive method, which involves sliding the pattern across the text, character by character, and performing comparisons at each position. While simple to understand and implement, the naive approach can be inefficient, especially in scenarios that represent the worst-case time complexity of $O(nm)$.

The Rabin-Karp algorithm offers an efficient alternative for string searching by employing hashing to detect pattern occurrences within a text. It is based on the principle of using a hash function to rapidly assess whether a substring of the text has the potential to match the pattern. By converting strings into numerical hash values, the algorithm can perform quick comparisons, significantly reducing the number of character-by-character checks needed.

This lecture note will explore the Rabin-Karp algorithm in detail, covering its underlying principles, implementation steps, and complexity analysis. We will also briefly touch upon other pattern matching techniques that were mentioned in the lecture, providing a broader context for string searching algorithms.

Pattern Matching Problem and Pre-processing

Problem Definition

Definition 1 (Pattern Matching Problem). Given a text $T$ of length $n$ and a pattern $P$ of length $m$, both defined over an alphabet $\Sigma$, the pattern matching problem is to find all occurrences of $P$ in $T$. This involves identifying all starting positions $s$ in $T$ such that the substring of $T$ starting at index $s+1$ and of length $m$, denoted as $T[s+1 \dots s+m]$, is identical to the pattern $P[1 \dots m]$.

Pre-processing Approaches

Pre-processing is a crucial step in optimizing pattern matching algorithms. By investing time upfront to analyze either the pattern or the text, we can significantly speed up the subsequent search process. Pre-processing techniques can be broadly categorized based on what is being pre-processed: the pattern or the text itself.

Pre-processing the Pattern

Pre-processing the pattern is particularly effective when we are searching for a fixed pattern in various texts or a streaming text. The goal is to extract information from the pattern that will help us quickly identify or reject potential matches as we scan the text. Common techniques include:

Finite Automata: This approach involves constructing a deterministic finite automaton (DFA) that recognizes occurrences of the pattern. The DFA is built based on the pattern and can then process the text in linear time. For each character in the text, the automaton transitions to a new state, indicating the progress in matching the pattern. When the automaton reaches an accepting state, a pattern match is found. While the search phase is very efficient, the pre-processing time to construct the automaton can be proportional to $O(m \cdot |\Sigma|)$, where $m$ is the length of the pattern and $|\Sigma|$ is the size of the alphabet. This method is particularly suitable when the alphabet size is relatively small.
Knuth-Morris-Pratt (KMP) Algorithm: The KMP algorithm also pre-processes the pattern, but instead of building a complete automaton, it computes a prefix function (also known as a failure function). This function helps to determine the longest proper prefix of the pattern that is also a suffix of the pattern up to a certain position. This information is then used to avoid redundant comparisons when a mismatch occurs during the search. Specifically, upon a mismatch, the KMP algorithm uses the prefix function to shift the pattern to the right by a certain amount, resuming the comparison from a more informed position in the text. The pre-processing time for KMP is $O(m)$, and the search time is $O(n)$, resulting in an overall linear time complexity of $O(n+m)$. The KMP algorithm is efficient and widely used due to its optimal time complexity and relatively simple implementation.

Pre-processing the Text

Pre-processing the text is advantageous when the text is static, and we need to perform multiple searches for different patterns within it. By building an index of the text, we can significantly speed up each subsequent search. A prominent example is:

Suffix Tree: A suffix tree is a powerful tree-based data structure that represents all suffixes of a given text. Each path from the root to a leaf in the suffix tree corresponds to a suffix of the text. Suffix trees can be constructed in linear time, $O(n)$, where $n$ is the length of the text. Once a suffix tree is built for a text, searching for a pattern of length $m$ can be done in $O(m)$ time, plus the time to report occurrences. Suffix trees are incredibly versatile and support a wide range of string operations beyond basic pattern matching, such as finding the longest common substring, all repeated substrings, and more. They are particularly useful in applications like bioinformatics and text indexing where complex string queries are common.

The lecture notes also mention "advanced applications (using Tanan-Harel)" and "elementary applications." These likely refer to more specialized uses or optimizations of pattern matching techniques in different contexts. "Tanan-Harel" might refer to advanced algorithms or data structures related to lowest common ancestor queries in trees, which can be relevant in sophisticated pattern matching scenarios, especially when combined with suffix trees or similar structures. "Elementary applications" could refer to basic uses of pattern matching in simple text editors or search functionalities. However, without further context, we will focus on the core algorithms of Rabin-Karp, KMP, and the concept of finite automata and suffix trees as fundamental techniques in pattern matching.

Choosing the appropriate pre-processing technique depends on the specific application requirements, such as the frequency of searches, the variability of patterns, and the size and mutability of the text. For single pattern searches in variable texts, algorithms like KMP or Rabin-Karp are often preferred due to their efficiency and ease of implementation. For scenarios requiring multiple searches in a static text, especially with complex queries, suffix trees offer unparalleled performance after the initial pre-processing phase.

Rabin-Karp Algorithm

The Rabin-Karp algorithm employs hashing to solve the pattern matching problem. The central idea is to compute a hash value for the pattern and then compute hash values for substrings of the text that are the same length as the pattern. By comparing these hash values, we can quickly identify potential matches. When the hash value of a text substring matches the hash value of the pattern, we perform a character-by-character verification to confirm an actual match. This verification step is crucial for resolving potential spurious hits, also known as hash collisions, where different strings might have the same hash value.

Hashing for Pattern Matching

The Rabin-Karp algorithm leverages the concept of representing strings as numbers to enable efficient comparison. We can interpret a string of characters as a number in base $d$, where $d$ is the size of the alphabet $\Sigma$. Each character is mapped to a numerical value. For instance, if we consider the alphabet of decimal digits $\Sigma= \{0, 1, \dots, 9\}$, we can treat the string "123" as the number $1 \times 10^2 + 2 \times 10^1 + 3 \times 10^0 = 123$. For alphabets like ASCII or Unicode, we can similarly map each character to its ASCII or Unicode value.

Given a pattern $P = P[1 \dots m]$ and a text $T = T[1 \dots n]$, our goal is to efficiently compare the pattern with every substring of $T$ of length $m$. Let’s consider a substring of $T$ starting at position $s+1$, denoted as $T[s+1 \dots s+m]$.

To apply hashing, we compute a hash value for the pattern $P$ and, for each possible shift $s$, a hash value for the substring $T[s+1 \dots s+m]$. If the hash value of a substring matches the hash value of the pattern, it indicates a potential match. We then need to perform a character-by-character comparison between $P$ and $T[s+1 \dots s+m]$ to verify if it is a true match or a spurious hit (a collision).

Choosing a Hash Function and Modulo Arithmetic

Directly computing and comparing hash values as large integers can be computationally expensive and may lead to overflow issues, especially with long patterns and texts. To manage the size of hash values and ensure computational efficiency, we use modulo arithmetic. We choose a prime number $q$ as the modulus and compute all hash values modulo $q$. This keeps the hash values within a manageable range and allows for faster computations.

For a pattern $P = P[1 \dots m]$, we calculate its hash value $p$ as follows: \[p = \left( \sum_{i=1}^{m} P[i] \cdot d^{m-i} \right) \pmod{q} = \left( P[1] \cdot d^{m-1} + P[2] \cdot d^{m-2} + \dots + P[m] \cdot d^0 \right) \pmod{q}\] Here, $P[i]$ represents the numerical value of the $i$-th character of the pattern, and $d = |\Sigma|$ is the base, typically the size of the alphabet. All calculations are performed modulo $q$.

Similarly, for a substring $T[s+1 \dots s+m]$, its hash value $t_s$ is computed as: \[t_s = \left( \sum_{i=1}^{m} T[s+i] \cdot d^{m-i} \right) \pmod{q} = \left( T[s+1] \cdot d^{m-1} + T[s+2] \cdot d^{m-2} + \dots + T[s+m] \cdot d^0 \right) \pmod{q}\]

The choice of the prime modulus $q$ is crucial. A sufficiently large prime number helps to minimize the probability of hash collisions. If $q$ is too small, many non-matching substrings might have the same hash value as the pattern, leading to frequent spurious hits and increasing the verification overhead.

Horner’s Rule for Efficient Hash Calculation

To efficiently compute the hash values, especially for successive substrings in the text, we employ Horner’s rule. Horner’s rule provides an optimized way to evaluate polynomials. For calculating the pattern hash $p$, we can rewrite the polynomial evaluation in a more efficient iterative form:

\[p = (\dots ((P[1] \cdot d + P[2]) \cdot d + P[3]) \dots ) \cdot d + P[m] \pmod{q}\] This formula allows us to compute $p$ iteratively, starting from the first character of the pattern and accumulating the hash value as we process each subsequent character.

For example, to compute the hash of pattern $P = "123"$ with base $d=10$ and modulus $q$, using Horner’s rule, we would calculate: \[\begin{aligned}h_1 &= P[1] = 1 \\h_2 &= (h_1 \cdot d + P[2]) \pmod{q} = (1 \cdot 10 + 2) \pmod{q} = 12 \pmod{q} \\p = h_3 &= (h_2 \cdot d + P[3]) \pmod{q} = (12 \cdot 10 + 3) \pmod{q} = 123 \pmod{q}\end{aligned}\]

This iterative approach is more efficient than direct polynomial evaluation, especially when combined with the rolling hash technique for processing text substrings.

Rolling Hash for Substring Hash Updates

A key optimization in the Rabin-Karp algorithm is the efficient update of hash values for successive substrings of the text. Instead of computing the hash value for each substring from scratch, we can use a "rolling hash" technique to compute the hash value of the next substring in $O(1)$ time, given the hash value of the current substring.

Suppose we have already computed the hash value $t_s$ for the substring $T[s+1 \dots s+m]$. We want to compute the hash value $t_{s+1}$ for the next substring $T[s+2 \dots s+m+1]$. We have: \[t_s = (T[s+1] \cdot d^{m-1} + T[s+2] \cdot d^{m-2} + \dots + T[s+m] \cdot d^0) \pmod{q}\] \[t_{s+1} = (T[s+2] \cdot d^{m-1} + T[s+3] \cdot d^{m-2} + \dots + T[s+m+1] \cdot d^0) \pmod{q}\]

To efficiently transition from $t_s$ to $t_{s+1}$, we can use the following update formula. First, pre-calculate $h = d^{m-1} \pmod{q}$. Then, the update formula is: \[t_{s+1} = (d \cdot (t_s - T[s+1] \cdot h) + T[s+m+1]) \pmod{q}\] This formula works because we are effectively:

Subtracting the contribution of the leading character $T[s+1]$ from $t_s$. Since $T[s+1]$ is multiplied by $d^{m-1}$ in the polynomial representation of $t_s$, we subtract $T[s+1] \cdot d^{m-1} \pmod{q}$, which is $T[s+1] \cdot h \pmod{q}$.
Multiplying the remaining hash value by $d$, which effectively shifts all the powers of $d$ up by one.
Adding the contribution of the new trailing character $T[s+m+1]$, which is multiplied by $d^0 = 1$.

All operations are performed modulo $q$. This rolling hash technique allows us to update the hash value in constant time, $O(1)$, assuming basic arithmetic operations take constant time.

For the initial substring $T[1 \dots m]$, we compute its hash value $t_0$ using Horner’s rule, just as we did for the pattern hash $p$. Then, for each subsequent position $s = 1, 2, \dots, n-m$, we use the rolling hash update formula to compute $t_s$ from $t_{s-1}$.

Rabin-Karp Algorithm Procedure

The complete Rabin-Karp algorithm is outlined in [alg:rabin_karp].

$n \leftarrow \text{length}(T)$ $m \leftarrow \text{length}(P)$ $h \leftarrow (d^{m-1}) \pmod{q}$ $p \leftarrow 0$ $t_0 \leftarrow 0$ $p \leftarrow (d \cdot p + P[i]) \pmod{q}$ $t_0 \leftarrow (d \cdot t_0 + T[i]) \pmod{q}$ print "Pattern occurs with shift" $s$ $t_{s+1} \leftarrow (d \cdot (t_s - T[s+1] \cdot h) + T[s+m+1]) \pmod{q}$

Example Walkthrough

Let’s illustrate the Rabin-Karp algorithm with an example. Suppose we want to find the pattern $P = "26"$ in the text $T = "31415926535"$. Let’s choose base $d=10$ and a prime modulus $q=11$.

Pre-processing:

$m = 2$, $P = "26"$, $T = "31415926535"$, $d = 10$, $q = 11$.
Calculate pattern hash $p$: $p = (2 \times 10^1 + 6 \times 10^0) \pmod{11} = 26 \pmod{11} = 4$.
Pre-calculate $h = d^{m-1} \pmod{q} = 10^{2-1} \pmod{11} = 10 \pmod{11} = 10$.
Calculate initial text substring hash $t_0$ for $T[1 \dots 2] = "31"$: $t_0 = (3 \times 10^1 + 1 \times 10^0) \pmod{11} = 31 \pmod{11} = 9$.

Matching:

$s=0$: $t_0 = 9 \neq p = 4$. No match.
$s=1$: Calculate $t_1$ for $T[2 \dots 3] = "14"$. Using rolling hash: $t_1 = (10 \cdot (t_0 - T[1] \cdot h) + T[3]) \pmod{11} = (10 \cdot (9 - 3 \times 10) + 4) \pmod{11} = (10 \cdot (9 - 30) + 4) \pmod{11} = (10 \cdot (-21) + 4) \pmod{11} = (-210 + 4) \pmod{11} = -206 \pmod{11} = 3$. Alternatively, directly calculate $t_1 = (1 \times 10^1 + 4 \times 10^0) \pmod{11} = 14 \pmod{11} = 3$. $t_1 = 3 \neq p = 4$. No match.
$s=2$: Calculate $t_2$ for $T[3 \dots 4] = "41"$. $t_2 = (10 \cdot (t_1 - T[2] \cdot h) + T[4]) \pmod{11} = (10 \cdot (3 - 1 \times 10) + 1) \pmod{11} = (10 \cdot (3 - 10) + 1) \pmod{11} = (10 \cdot (-7) + 1) \pmod{11} = (-70 + 1) \pmod{11} = -69 \pmod{11} = 7$. $t_2 = 7 \neq p = 4$. No match.
$s=3$: Calculate $t_3$ for $T[4 \dots 5] = "15"$. $t_3 = (10 \cdot (t_2 - T[3] \cdot h) + T[5]) \pmod{11} = (10 \cdot (7 - 4 \times 10) + 5) \pmod{11} = (10 \cdot (7 - 40) + 5) \pmod{11} = (10 \cdot (-33) + 5) \pmod{11} = (-330 + 5) \pmod{11} = -325 \pmod{11} = 5$. $t_3 = 5 \neq p = 4$. No match.
$s=4$: Calculate $t_4$ for $T[5 \dots 6] = "59"$. $t_4 = (10 \cdot (t_3 - T[4] \cdot h) + T[6]) \pmod{11} = (10 \cdot (5 - 1 \times 10) + 9) \pmod{11} = (10 \cdot (5 - 10) + 9) \pmod{11} = (10 \cdot (-5) + 9) \pmod{11} = (-50 + 9) \pmod{11} = -41 \pmod{11} = 3$. $t_4 = 3 \neq p = 4$. No match.
$s=5$: Calculate $t_5$ for $T[6 \dots 7] = "92"$. $t_5 = (10 \cdot (t_4 - T[5] \cdot h) + T[7]) \pmod{11} = (10 \cdot (3 - 5 \times 10) + 2) \pmod{11} = (10 \cdot (3 - 50) + 2) \pmod{11} = (10 \cdot (-47) + 2) \pmod{11} = (-470 + 2) \pmod{11} = -468 \pmod{11} = 4$. $t_5 = 4 = p = 4$. Hash match! Verify: $T[6 \dots 7] = "92" \neq P = "26"$. Spurious hit!
$s=6$: Calculate $t_6$ for $T[7 \dots 8] = "26"$. $t_6 = (10 \cdot (t_5 - T[6] \cdot h) + T[8]) \pmod{11} = (10 \cdot (4 - 9 \times 10) + 6) \pmod{11} = (10 \cdot (4 - 90) + 6) \pmod{11} = (10 \cdot (-86) + 6) \pmod{11} = (-860 + 6) \pmod{11} = -854 \pmod{11} = 4$. $t_6 = 4 = p = 4$. Hash match! Verify: $T[7 \dots 8] = "26" = P = "26"$. Match found at shift $s=6$.
... continue for remaining shifts.

In this example, we found one true match at shift 6 and one spurious hit at shift 5, which was correctly identified and rejected during the verification step.

Complexity Analysis

The Rabin-Karp algorithm’s efficiency is determined by its time complexity, which can be analyzed in terms of pre-processing and matching phases.

Time Complexity

Pre-processing Time: The pre-processing step involves computing the hash value of the pattern $P$ of length $m$ using Horner’s rule, and pre-calculating $h = d^{m-1} \pmod{q}$. Computing the pattern hash requires iterating through the pattern once, performing a constant number of arithmetic operations for each character. Similarly, calculating $h$ using binary exponentiation or repeated multiplication takes logarithmic or linear time in $m$, respectively, but can be considered $O(m)$ in the context of string length. Computing the initial hash $t_0$ for the first $m$-length substring of the text also takes $O(m)$ time. Therefore, the total time complexity for pre-processing is $O(m)$.
Matching Time: The matching phase consists of a loop that iterates $n-m+1$ times, corresponding to each possible shift in the text $T$ of length $n$. Inside this loop, we perform the following operations:
1. Hash Comparison: Comparing the hash value of the pattern $p$ with the current substring hash $t_s$ takes $O(1)$ time.
2. Rolling Hash Update: If it’s not the last substring, updating the hash value to the next substring using the rolling hash technique takes $O(1)$ time.
3. Verification: In case of a hash match ($p \equiv t_s \pmod{q}$), we perform a character-by-character verification to confirm if $P[1 \dots m] = T[s+1 \dots s+m]$. This verification step takes $O(m)$ time in the worst case, as we might need to compare all $m$ characters.
The crucial factor in the matching time complexity is the frequency of hash collisions, which lead to verifications.

Worst-Case Time Complexity

In the worst-case scenario, hash collisions occur frequently. Imagine a contrived situation where the hash function is poorly chosen, or we are dealing with strings designed to maximize collisions. In an extreme case, every substring of length $m$ might have the same hash value as the pattern. If this happens, for every shift, we would have a hash match, and consequently, we would perform a full $O(m)$ character-by-character verification. Since there are $O(n)$ possible shifts, the total time complexity in this worst-case scenario degrades to $O(n \cdot m)$, which is the same as the naive string matching algorithm.

Average-Case Time Complexity

In practice, with a well-chosen hash function and a sufficiently large prime modulus $q$, the probability of spurious hits (hash collisions where strings are not actually equal) is significantly reduced. Assuming that hash collisions are rare, the verification step is performed infrequently.

Let’s consider the expected number of spurious hits. If we assume that the hash function behaves randomly and uniformly distributes hash values, the probability of a hash collision for any given substring is approximately $1/q$. Therefore, the expected number of spurious hits over $n-m+1$ substrings is roughly $(n-m+1)/q \approx n/q$. For each spurious hit, we perform an $O(m)$ verification. The expected time spent on spurious hit verifications is then approximately $O(\frac{nm}{q})$.

The expected number of actual pattern occurrences is denoted by $k$. For each actual occurrence, we also perform a verification, taking $O(m)$ time. So, the total time for verifying actual matches is $O(km)$.

The total expected matching time is then the sum of the time for hash comparisons, rolling hash updates (both $O(n)$), and the expected verification time for spurious and actual hits: $O(n) + O(\frac{nm}{q}) + O(km)$. If we choose a prime $q$ large enough such that $q \gg m$, then $\frac{nm}{q}$ becomes small, ideally close to constant or less than $O(n)$. In such cases, and if the number of actual matches $k$ is not excessively large (e.g., $k \ll n$), the average-case time complexity of the Rabin-Karp algorithm approaches $O(n + m)$, dominated by the initial hash computations and the linear scan through the text.

In summary, the expected time complexity is $O(n + m + \frac{nm}{q} + km)$. By choosing a large prime $q$, we can minimize the term $\frac{nm}{q}$. If we consider the expected number of spurious hits $v$ to be $O(n/q)$ and the number of actual matches $k$, the expected complexity is $O(m) + O(n + (v+k)m) = O(n + m + (\frac{n}{q} + k)m)$.

Choosing the Modulus $q$

The selection of the prime modulus $q$ is a critical factor influencing the Rabin-Karp algorithm’s performance in practice.

In practice, for most applications, choosing a prime modulus $q$ that is a large prime number, but still allows for efficient modulo operations, provides a good balance between minimizing collisions and maintaining computational speed, leading to near $O(n+m)$ average-case performance.

Complexity Summary

Pre-processing Time: $O(m)$
Average-Case Matching Time: $O(n + m)$ (with a good hash function and large enough prime modulus)
Worst-Case Matching Time: $O(n \cdot m)$ (in case of frequent hash collisions)
Space Complexity: $O(1)$ (Rabin-Karp algorithm requires constant extra space beyond input strings)

It’s important to note that while the worst-case time complexity of Rabin-Karp is not optimal, its average-case performance is excellent, making it a practical and efficient algorithm for many pattern matching tasks, especially when implemented with careful consideration of the hash function and modulus choice.

Conclusion

Remark. The Rabin-Karp algorithm stands out as a practical and efficient solution for the pattern matching problem, primarily due to its use of hashing. By converting string comparison into numerical hash value comparisons, it achieves an average-case time complexity of $O(n+m)$, making it highly suitable for a wide range of applications, from text searching in editors to complex tasks in bioinformatics and network security.

The algorithm’s efficiency hinges on two key techniques:

Hashing: Transforming patterns and text substrings into hash values allows for quick preliminary comparisons. This drastically reduces the number of full string comparisons needed, which are more computationally expensive.
Rolling Hash: The rolling hash technique enables the algorithm to update the hash value of a substring in constant time as the search window slides through the text. This is a significant optimization compared to re-calculating the hash for each substring from scratch, which would be less efficient.

While the Rabin-Karpalgorithm boasts an impressive average-case performance, it is important to acknowledge its worst-case time complexity of $O(nm)$. This occurs when hash collisions are frequent, leading to numerous unnecessary verifications. However, in practice, the probability of encountering the worst-case scenario can be minimized by:

Careful Hash Function Design: Choosing a hash function that distributes string hash values uniformly and reduces the likelihood of collisions is crucial. Polynomial rolling hash functions, as used in Rabin-Karp, are generally effective for this purpose.
Prime Modulus Selection: Selecting a sufficiently large prime modulus $q$ for hash computations significantly decreases the probability of collisions. The modulus should be large enough to accommodate the range of possible hash values and minimize coincidental matches between different strings.

In summary, the Rabin-Karp algorithm provides an excellent balance between performance and simplicity. It is relatively easy to understand and implement, yet it offers near-optimal performance in most practical scenarios. This makes it a valuable tool in the algorithm designer’s toolkit for string processing and pattern matching tasks. Its average-case linear time complexity, combined with the efficiency of rolling hash, positions Rabin-Karp as a highly effective algorithm for real-world applications where pattern matching is essential.

Exercises

Implement the Rabin-Karp algorithm in your preferred programming language. Test its performance with various texts and patterns, including scenarios with and without pattern occurrences.
Investigate the impact of different prime moduli $q$ on the performance of your Rabin-Karp implementation. Experiment with small and large prime numbers and observe the number of spurious hits and the overall execution time.
Consider the alphabet $\Sigma= \{a, b, c, \dots, z\}$. Choose an appropriate base $d$ and prime modulus $q$ for the Rabin-Karp algorithm. Find all occurrences of the pattern "aba" in the text "abacabababacaba". Manually trace the algorithm steps, including hash calculations and verifications.
Compare the Rabin-Karp algorithm with the naive string matching algorithm in terms of time complexity and practical performance. Under what conditions would you prefer one algorithm over the other?
Research and provide a brief explanation of how to select a "good" prime modulus $q$ for the Rabin-Karp algorithm to minimize hash collisions in practical applications.

--- title: "Rabin-Karp Algorithm for Pattern Matching" author: "Your Name" date: "2025-02-08" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction Pattern matching is a fundamental problem in computer science with wide-ranging applications, including text searching, bioinformatics, and network security. The core task is to locate instances of a specific pattern, which is a short string, within a larger text string. Consider the problem of finding all occurrences of a pattern $P$ of length $m$ in a text $T$ of length $n$. A straightforward approach is the naive method, which involves sliding the pattern across the text, character by character, and performing comparisons at each position. While simple to understand and implement, the naive approach can be inefficient, especially in scenarios that represent the worst-case time complexity of $O(nm)$. The Rabin-Karp algorithm offers an efficient alternative for string searching by employing hashing to detect pattern occurrences within a text. It is based on the principle of using a hash function to rapidly assess whether a substring of the text has the potential to match the pattern. By converting strings into numerical hash values, the algorithm can perform quick comparisons, significantly reducing the number of character-by-character checks needed. This lecture note will explore the Rabin-Karp algorithm in detail, covering its underlying principles, implementation steps, and complexity analysis. We will also briefly touch upon other pattern matching techniques that were mentioned in the lecture, providing a broader context for string searching algorithms. # Pattern Matching Problem and Pre-processing ## Problem Definition :::: tcolorbox ::: definition **Definition 1** (Pattern Matching Problem). *Given a text $T$ of length $n$ and a pattern $P$ of length $m$, both defined over an alphabet $\Sigma$, the pattern matching problem is to find all occurrences of $P$ in $T$. This involves identifying all starting positions $s$ in $T$ such that the substring of $T$ starting at index $s+1$ and of length $m$, denoted as $T[s+1 \dots s+m]$, is identical to the pattern $P[1 \dots m]$.* ::: :::: ## Pre-processing Approaches Pre-processing is a crucial step in optimizing pattern matching algorithms. By investing time upfront to analyze either the pattern or the text, we can significantly speed up the subsequent search process. Pre-processing techniques can be broadly categorized based on what is being pre-processed: the pattern or the text itself. ### Pre-processing the Pattern Pre-processing the pattern is particularly effective when we are searching for a fixed pattern in various texts or a streaming text. The goal is to extract information from the pattern that will help us quickly identify or reject potential matches as we scan the text. Common techniques include: - **Finite Automata**: This approach involves constructing a deterministic finite automaton (DFA) that recognizes occurrences of the pattern. The DFA is built based on the pattern and can then process the text in linear time. For each character in the text, the automaton transitions to a new state, indicating the progress in matching the pattern. When the automaton reaches an accepting state, a pattern match is found. While the search phase is very efficient, the pre-processing time to construct the automaton can be proportional to $O(m \cdot |\Sigma|)$, where $m$ is the length of the pattern and $|\Sigma|$ is the size of the alphabet. This method is particularly suitable when the alphabet size is relatively small. - **Knuth-Morris-Pratt (KMP) Algorithm**: The KMP algorithm also pre-processes the pattern, but instead of building a complete automaton, it computes a prefix function (also known as a failure function). This function helps to determine the longest proper prefix of the pattern that is also a suffix of the pattern up to a certain position. This information is then used to avoid redundant comparisons when a mismatch occurs during the search. Specifically, upon a mismatch, the KMP algorithm uses the prefix function to shift the pattern to the right by a certain amount, resuming the comparison from a more informed position in the text. The pre-processing time for KMP is $O(m)$, and the search time is $O(n)$, resulting in an overall linear time complexity of $O(n+m)$. The KMP algorithm is efficient and widely used due to its optimal time complexity and relatively simple implementation. ### Pre-processing the Text Pre-processing the text is advantageous when the text is static, and we need to perform multiple searches for different patterns within it. By building an index of the text, we can significantly speed up each subsequent search. A prominent example is: - **Suffix Tree**: A suffix tree is a powerful tree-based data structure that represents all suffixes of a given text. Each path from the root to a leaf in the suffix tree corresponds to a suffix of the text. Suffix trees can be constructed in linear time, $O(n)$, where $n$ is the length of the text. Once a suffix tree is built for a text, searching for a pattern of length $m$ can be done in $O(m)$ time, plus the time to report occurrences. Suffix trees are incredibly versatile and support a wide range of string operations beyond basic pattern matching, such as finding the longest common substring, all repeated substrings, and more. They are particularly useful in applications like bioinformatics and text indexing where complex string queries are common. The lecture notes also mention \"advanced applications (using Tanan-Harel)\" and \"elementary applications.\" These likely refer to more specialized uses or optimizations of pattern matching techniques in different contexts. \"Tanan-Harel\" might refer to advanced algorithms or data structures related to lowest common ancestor queries in trees, which can be relevant in sophisticated pattern matching scenarios, especially when combined with suffix trees or similar structures. \"Elementary applications\" could refer to basic uses of pattern matching in simple text editors or search functionalities. However, without further context, we will focus on the core algorithms of Rabin-Karp, KMP, and the concept of finite automata and suffix trees as fundamental techniques in pattern matching. Choosing the appropriate pre-processing technique depends on the specific application requirements, such as the frequency of searches, the variability of patterns, and the size and mutability of the text. For single pattern searches in variable texts, algorithms like KMP or Rabin-Karp are often preferred due to their efficiency and ease of implementation. For scenarios requiring multiple searches in a static text, especially with complex queries, suffix trees offer unparalleled performance after the initial pre-processing phase. # Rabin-Karp Algorithm The Rabin-Karp algorithm employs hashing to solve the pattern matching problem. The central idea is to compute a hash value for the pattern and then compute hash values for substrings of the text that are the same length as the pattern. By comparing these hash values, we can quickly identify potential matches. When the hash value of a text substring matches the hash value of the pattern, we perform a character-by-character verification to confirm an actual match. This verification step is crucial for resolving potential spurious hits, also known as hash collisions, where different strings might have the same hash value. ## Hashing for Pattern Matching The Rabin-Karp algorithm leverages the concept of representing strings as numbers to enable efficient comparison. We can interpret a string of characters as a number in base $d$, where $d$ is the size of the alphabet $\Sigma$. Each character is mapped to a numerical value. For instance, if we consider the alphabet of decimal digits $\Sigma= \{0, 1, \dots, 9\}$, we can treat the string \"123\" as the number $1 \times 10^2 + 2 \times 10^1 + 3 \times 10^0 = 123$. For alphabets like ASCII or Unicode, we can similarly map each character to its ASCII or Unicode value. Given a pattern $P = P[1 \dots m]$ and a text $T = T[1 \dots n]$, our goal is to efficiently compare the pattern with every substring of $T$ of length $m$. Let's consider a substring of $T$ starting at position $s+1$, denoted as $T[s+1 \dots s+m]$. To apply hashing, we compute a hash value for the pattern $P$ and, for each possible shift $s$, a hash value for the substring $T[s+1 \dots s+m]$. If the hash value of a substring matches the hash value of the pattern, it indicates a potential match. We then need to perform a character-by-character comparison between $P$ and $T[s+1 \dots s+m]$ to verify if it is a true match or a spurious hit (a collision). ## Choosing a Hash Function and Modulo Arithmetic Directly computing and comparing hash values as large integers can be computationally expensive and may lead to overflow issues, especially with long patterns and texts. To manage the size of hash values and ensure computational efficiency, we use modulo arithmetic. We choose a prime number $q$ as the modulus and compute all hash values modulo $q$. This keeps the hash values within a manageable range and allows for faster computations. For a pattern $P = P[1 \dots m]$, we calculate its hash value $p$ as follows: $$p = \left( \sum_{i=1}^{m} P[i] \cdot d^{m-i} \right) \pmod{q} = \left( P[1] \cdot d^{m-1} + P[2] \cdot d^{m-2} + \dots + P[m] \cdot d^0 \right) \pmod{q}$$ Here, $P[i]$ represents the numerical value of the $i$-th character of the pattern, and $d = |\Sigma|$ is the base, typically the size of the alphabet. All calculations are performed modulo $q$. Similarly, for a substring $T[s+1 \dots s+m]$, its hash value $t_s$ is computed as: $$t_s = \left( \sum_{i=1}^{m} T[s+i] \cdot d^{m-i} \right) \pmod{q} = \left( T[s+1] \cdot d^{m-1} + T[s+2] \cdot d^{m-2} + \dots + T[s+m] \cdot d^0 \right) \pmod{q}$$ The choice of the prime modulus $q$ is crucial. A sufficiently large prime number helps to minimize the probability of hash collisions. If $q$ is too small, many non-matching substrings might have the same hash value as the pattern, leading to frequent spurious hits and increasing the verification overhead. ## Horner's Rule for Efficient Hash Calculation To efficiently compute the hash values, especially for successive substrings in the text, we employ Horner's rule. Horner's rule provides an optimized way to evaluate polynomials. For calculating the pattern hash $p$, we can rewrite the polynomial evaluation in a more efficient iterative form: $$p = (\dots ((P[1] \cdot d + P[2]) \cdot d + P[3]) \dots ) \cdot d + P[m] \pmod{q}$$ This formula allows us to compute $p$ iteratively, starting from the first character of the pattern and accumulating the hash value as we process each subsequent character. For example, to compute the hash of pattern $P = "123"$ with base $d=10$ and modulus $q$, using Horner's rule, we would calculate: $$\begin{aligned}h_1 &= P[1] = 1 \\h_2 &= (h_1 \cdot d + P[2]) \pmod{q} = (1 \cdot 10 + 2) \pmod{q} = 12 \pmod{q} \\p = h_3 &= (h_2 \cdot d + P[3]) \pmod{q} = (12 \cdot 10 + 3) \pmod{q} = 123 \pmod{q}\end{aligned}$$ This iterative approach is more efficient than direct polynomial evaluation, especially when combined with the rolling hash technique for processing text substrings. ## Rolling Hash for Substring Hash Updates A key optimization in the Rabin-Karp algorithm is the efficient update of hash values for successive substrings of the text. Instead of computing the hash value for each substring from scratch, we can use a \"rolling hash\" technique to compute the hash value of the next substring in $O(1)$ time, given the hash value of the current substring. Suppose we have already computed the hash value $t_s$ for the substring $T[s+1 \dots s+m]$. We want to compute the hash value $t_{s+1}$ for the next substring $T[s+2 \dots s+m+1]$. We have: $$t_s = (T[s+1] \cdot d^{m-1} + T[s+2] \cdot d^{m-2} + \dots + T[s+m] \cdot d^0) \pmod{q}$$ $$t_{s+1} = (T[s+2] \cdot d^{m-1} + T[s+3] \cdot d^{m-2} + \dots + T[s+m+1] \cdot d^0) \pmod{q}$$ To efficiently transition from $t_s$ to $t_{s+1}$, we can use the following update formula. First, pre-calculate $h = d^{m-1} \pmod{q}$. Then, the update formula is: $$t_{s+1} = (d \cdot (t_s - T[s+1] \cdot h) + T[s+m+1]) \pmod{q}$$ This formula works because we are effectively: 1. Subtracting the contribution of the leading character $T[s+1]$ from $t_s$. Since $T[s+1]$ is multiplied by $d^{m-1}$ in the polynomial representation of $t_s$, we subtract $T[s+1] \cdot d^{m-1} \pmod{q}$, which is $T[s+1] \cdot h \pmod{q}$. 2. Multiplying the remaining hash value by $d$, which effectively shifts all the powers of $d$ up by one. 3. Adding the contribution of the new trailing character $T[s+m+1]$, which is multiplied by $d^0 = 1$. All operations are performed modulo $q$. This rolling hash technique allows us to update the hash value in constant time, $O(1)$, assuming basic arithmetic operations take constant time. For the initial substring $T[1 \dots m]$, we compute its hash value $t_0$ using Horner's rule, just as we did for the pattern hash $p$. Then, for each subsequent position $s = 1, 2, \dots, n-m$, we use the rolling hash update formula to compute $t_s$ from $t_{s-1}$. ## Rabin-Karp Algorithm Procedure The complete Rabin-Karp algorithm is outlined in [\[alg:rabin_karp\]](#alg:rabin_karp){reference-type="ref+Label" reference="alg:rabin_karp"}. ::::: tcolorbox :::: algorithm ::: algorithmic $n \leftarrow \text{length}(T)$ $m \leftarrow \text{length}(P)$ $h \leftarrow (d^{m-1}) \pmod{q}$ $p \leftarrow 0$ $t_0 \leftarrow 0$ $p \leftarrow (d \cdot p + P[i]) \pmod{q}$ $t_0 \leftarrow (d \cdot t_0 + T[i]) \pmod{q}$ print \"Pattern occurs with shift\" $s$ $t_{s+1} \leftarrow (d \cdot (t_s - T[s+1] \cdot h) + T[s+m+1]) \pmod{q}$ ::: :::: ::::: ## Example Walkthrough Let's illustrate the Rabin-Karp algorithm with an example. Suppose we want to find the pattern $P = "26"$ in the text $T = "31415926535"$. Let's choose base $d=10$ and a prime modulus $q=11$. **Pre-processing:** 1. $m = 2$, $P = "26"$, $T = "31415926535"$, $d = 10$, $q = 11$. 2. Calculate pattern hash $p$: $p = (2 \times 10^1 + 6 \times 10^0) \pmod{11} = 26 \pmod{11} = 4$. 3. Pre-calculate $h = d^{m-1} \pmod{q} = 10^{2-1} \pmod{11} = 10 \pmod{11} = 10$. 4. Calculate initial text substring hash $t_0$ for $T[1 \dots 2] = "31"$: $t_0 = (3 \times 10^1 + 1 \times 10^0) \pmod{11} = 31 \pmod{11} = 9$. **Matching:** 1. $s=0$: $t_0 = 9 \neq p = 4$. No match. 2. $s=1$: Calculate $t_1$ for $T[2 \dots 3] = "14"$. Using rolling hash: $t_1 = (10 \cdot (t_0 - T[1] \cdot h) + T[3]) \pmod{11} = (10 \cdot (9 - 3 \times 10) + 4) \pmod{11} = (10 \cdot (9 - 30) + 4) \pmod{11} = (10 \cdot (-21) + 4) \pmod{11} = (-210 + 4) \pmod{11} = -206 \pmod{11} = 3$. Alternatively, directly calculate $t_1 = (1 \times 10^1 + 4 \times 10^0) \pmod{11} = 14 \pmod{11} = 3$. $t_1 = 3 \neq p = 4$. No match. 3. $s=2$: Calculate $t_2$ for $T[3 \dots 4] = "41"$. $t_2 = (10 \cdot (t_1 - T[2] \cdot h) + T[4]) \pmod{11} = (10 \cdot (3 - 1 \times 10) + 1) \pmod{11} = (10 \cdot (3 - 10) + 1) \pmod{11} = (10 \cdot (-7) + 1) \pmod{11} = (-70 + 1) \pmod{11} = -69 \pmod{11} = 7$. $t_2 = 7 \neq p = 4$. No match. 4. $s=3$: Calculate $t_3$ for $T[4 \dots 5] = "15"$. $t_3 = (10 \cdot (t_2 - T[3] \cdot h) + T[5]) \pmod{11} = (10 \cdot (7 - 4 \times 10) + 5) \pmod{11} = (10 \cdot (7 - 40) + 5) \pmod{11} = (10 \cdot (-33) + 5) \pmod{11} = (-330 + 5) \pmod{11} = -325 \pmod{11} = 5$. $t_3 = 5 \neq p = 4$. No match. 5. $s=4$: Calculate $t_4$ for $T[5 \dots 6] = "59"$. $t_4 = (10 \cdot (t_3 - T[4] \cdot h) + T[6]) \pmod{11} = (10 \cdot (5 - 1 \times 10) + 9) \pmod{11} = (10 \cdot (5 - 10) + 9) \pmod{11} = (10 \cdot (-5) + 9) \pmod{11} = (-50 + 9) \pmod{11} = -41 \pmod{11} = 3$. $t_4 = 3 \neq p = 4$. No match. 6. $s=5$: Calculate $t_5$ for $T[6 \dots 7] = "92"$. $t_5 = (10 \cdot (t_4 - T[5] \cdot h) + T[7]) \pmod{11} = (10 \cdot (3 - 5 \times 10) + 2) \pmod{11} = (10 \cdot (3 - 50) + 2) \pmod{11} = (10 \cdot (-47) + 2) \pmod{11} = (-470 + 2) \pmod{11} = -468 \pmod{11} = 4$. $t_5 = 4 = p = 4$. Hash match! Verify: $T[6 \dots 7] = "92" \neq P = "26"$. Spurious hit! 7. $s=6$: Calculate $t_6$ for $T[7 \dots 8] = "26"$. $t_6 = (10 \cdot (t_5 - T[6] \cdot h) + T[8]) \pmod{11} = (10 \cdot (4 - 9 \times 10) + 6) \pmod{11} = (10 \cdot (4 - 90) + 6) \pmod{11} = (10 \cdot (-86) + 6) \pmod{11} = (-860 + 6) \pmod{11} = -854 \pmod{11} = 4$. $t_6 = 4 = p = 4$. Hash match! Verify: $T[7 \dots 8] = "26" = P = "26"$. Match found at shift $s=6$. 8. \... continue for remaining shifts. In this example, we found one true match at shift 6 and one spurious hit at shift 5, which was correctly identified and rejected during the verification step. # Complexity Analysis The Rabin-Karp algorithm's efficiency is determined by its time complexity, which can be analyzed in terms of pre-processing and matching phases. ## Time Complexity - **Pre-processing Time**: The pre-processing step involves computing the hash value of the pattern $P$ of length $m$ using Horner's rule, and pre-calculating $h = d^{m-1} \pmod{q}$. Computing the pattern hash requires iterating through the pattern once, performing a constant number of arithmetic operations for each character. Similarly, calculating $h$ using binary exponentiation or repeated multiplication takes logarithmic or linear time in $m$, respectively, but can be considered $O(m)$ in the context of string length. Computing the initial hash $t_0$ for the first $m$-length substring of the text also takes $O(m)$ time. Therefore, the total time complexity for pre-processing is $O(m)$. - **Matching Time**: The matching phase consists of a loop that iterates $n-m+1$ times, corresponding to each possible shift in the text $T$ of length $n$. Inside this loop, we perform the following operations: 1. **Hash Comparison**: Comparing the hash value of the pattern $p$ with the current substring hash $t_s$ takes $O(1)$ time. 2. **Rolling Hash Update**: If it's not the last substring, updating the hash value to the next substring using the rolling hash technique takes $O(1)$ time. 3. **Verification**: In case of a hash match ($p \equiv t_s \pmod{q}$), we perform a character-by-character verification to confirm if $P[1 \dots m] = T[s+1 \dots s+m]$. This verification step takes $O(m)$ time in the worst case, as we might need to compare all $m$ characters. The crucial factor in the matching time complexity is the frequency of hash collisions, which lead to verifications. ### Worst-Case Time Complexity In the worst-case scenario, hash collisions occur frequently. Imagine a contrived situation where the hash function is poorly chosen, or we are dealing with strings designed to maximize collisions. In an extreme case, every substring of length $m$ might have the same hash value as the pattern. If this happens, for every shift, we would have a hash match, and consequently, we would perform a full $O(m)$ character-by-character verification. Since there are $O(n)$ possible shifts, the total time complexity in this worst-case scenario degrades to $O(n \cdot m)$, which is the same as the naive string matching algorithm. ### Average-Case Time Complexity In practice, with a well-chosen hash function and a sufficiently large prime modulus $q$, the probability of spurious hits (hash collisions where strings are not actually equal) is significantly reduced. Assuming that hash collisions are rare, the verification step is performed infrequently. Let's consider the expected number of spurious hits. If we assume that the hash function behaves randomly and uniformly distributes hash values, the probability of a hash collision for any given substring is approximately $1/q$. Therefore, the expected number of spurious hits over $n-m+1$ substrings is roughly $(n-m+1)/q \approx n/q$. For each spurious hit, we perform an $O(m)$ verification. The expected time spent on spurious hit verifications is then approximately $O(\frac{nm}{q})$. The expected number of actual pattern occurrences is denoted by $k$. For each actual occurrence, we also perform a verification, taking $O(m)$ time. So, the total time for verifying actual matches is $O(km)$. The total expected matching time is then the sum of the time for hash comparisons, rolling hash updates (both $O(n)$), and the expected verification time for spurious and actual hits: $O(n) + O(\frac{nm}{q}) + O(km)$. If we choose a prime $q$ large enough such that $q \gg m$, then $\frac{nm}{q}$ becomes small, ideally close to constant or less than $O(n)$. In such cases, and if the number of actual matches $k$ is not excessively large (e.g., $k \ll n$), the average-case time complexity of the Rabin-Karp algorithm approaches $O(n + m)$, dominated by the initial hash computations and the linear scan through the text. In summary, the expected time complexity is $O(n + m + \frac{nm}{q} + km)$. By choosing a large prime $q$, we can minimize the term $\frac{nm}{q}$. If we consider the expected number of spurious hits $v$ to be $O(n/q)$ and the number of actual matches $k$, the expected complexity is $O(m) + O(n + (v+k)m) = O(n + m + (\frac{n}{q} + k)m)$. ## Choosing the Modulus $q$ The selection of the prime modulus $q$ is a critical factor influencing the Rabin-Karp algorithm's performance in practice. In practice, for most applications, choosing a prime modulus $q$ that is a large prime number, but still allows for efficient modulo operations, provides a good balance between minimizing collisions and maintaining computational speed, leading to near $O(n+m)$ average-case performance. ## Complexity Summary ::: tcolorbox - **Pre-processing Time**: $O(m)$ - **Average-Case Matching Time**: $O(n + m)$ (with a good hash function and large enough prime modulus) - **Worst-Case Matching Time**: $O(n \cdot m)$ (in case of frequent hash collisions) - **Space Complexity**: $O(1)$ (Rabin-Karp algorithm requires constant extra space beyond input strings) ::: It's important to note that while the worst-case time complexity of Rabin-Karp is not optimal, its average-case performance is excellent, making it a practical and efficient algorithm for many pattern matching tasks, especially when implemented with careful consideration of the hash function and modulus choice. # Conclusion :::: tcolorbox ::: remark The Rabin-Karp algorithm stands out as a practical and efficient solution for the pattern matching problem, primarily due to its use of hashing. By converting string comparison into numerical hash value comparisons, it achieves an average-case time complexity of $O(n+m)$, making it highly suitable for a wide range of applications, from text searching in editors to complex tasks in bioinformatics and network security. The algorithm's efficiency hinges on two key techniques: 1. **Hashing**: Transforming patterns and text substrings into hash values allows for quick preliminary comparisons. This drastically reduces the number of full string comparisons needed, which are more computationally expensive. 2. **Rolling Hash**: The rolling hash technique enables the algorithm to update the hash value of a substring in constant time as the search window slides through the text. This is a significant optimization compared to re-calculating the hash for each substring from scratch, which would be less efficient. While the Rabin-Karpalgorithm boasts an impressive average-case performance, it is important to acknowledge its worst-case time complexity of $O(nm)$. This occurs when hash collisions are frequent, leading to numerous unnecessary verifications. However, in practice, the probability of encountering the worst-case scenario can be minimized by: 1. **Careful Hash Function Design**: Choosing a hash function that distributes string hash values uniformly and reduces the likelihood of collisions is crucial. Polynomial rolling hash functions, as used in Rabin-Karp, are generally effective for this purpose. 2. **Prime Modulus Selection**: Selecting a sufficiently large prime modulus $q$ for hash computations significantly decreases the probability of collisions. The modulus should be large enough to accommodate the range of possible hash values and minimize coincidental matches between different strings. In summary, the Rabin-Karp algorithm provides an excellent balance between performance and simplicity. It is relatively easy to understand and implement, yet it offers near-optimal performance in most practical scenarios. This makes it a valuable tool in the algorithm designer's toolkit for string processing and pattern matching tasks. Its average-case linear time complexity, combined with the efficiency of rolling hash, positions Rabin-Karp as a highly effective algorithm for real-world applications where pattern matching is essential. ::: :::: # Exercises ::: tcolorbox 1. Implement the Rabin-Karp algorithm in your preferred programming language. Test its performance with various texts and patterns, including scenarios with and without pattern occurrences. 2. Investigate the impact of different prime moduli $q$ on the performance of your Rabin-Karp implementation. Experiment with small and large prime numbers and observe the number of spurious hits and the overall execution time. 3. Consider the alphabet $\Sigma= \{a, b, c, \dots, z\}$. Choose an appropriate base $d$ and prime modulus $q$ for the Rabin-Karp algorithm. Find all occurrences of the pattern \"aba\" in the text \"abacabababacaba\". Manually trace the algorithm steps, including hash calculations and verifications. 4. Compare the Rabin-Karp algorithm with the naive string matching algorithm in terms of time complexity and practical performance. Under what conditions would you prefer one algorithm over the other? 5. Research and provide a brief explanation of how to select a \"good\" prime modulus $q$ for the Rabin-Karp algorithm to minimize hash collisions in practical applications. :::