Boyer-Moore Algorithm for String Searching

Author

Your Name

Published

February 8, 2025

Introduction

String searching is a fundamental problem in computer science with applications in text editing, bioinformatics, web search, and more. Given a text \(T\) and a pattern \(P\), the goal is to find all occurrences of \(P\) within \(T\). The Boyer-Moore algorithm is an efficient string searching algorithm known for its good performance in practice, often outperforming simpler algorithms like naive string searching or even Knuth-Morris-Pratt (KMP) in many real-world scenarios.

These notes will explore the Boyer-Moore algorithm, focusing on the heuristics that make it efficient. We will discuss the core ideas, the "Bad Character Rule" and the "Good Suffix Rule", and consider some implementation details.

Boyer-Moore Algorithm

The Boyer-Moore algorithm distinguishes itself from simpler string searching algorithms by employing two key strategies:

  • Right-to-Left Comparison: Unlike naive algorithms that compare the pattern from left to right, Boyer-Moore starts comparing the pattern with the text from right to left within a potential match window.

  • Shift Heuristics: Boyer-Moore utilizes precomputed tables based on the pattern to determine how much to shift the pattern in case of a mismatch or a match. These shifts are crucial for skipping sections of the text and achieving efficiency.

The algorithm compares characters from right to left within a window and then shifts the pattern from left to right along the text. This right-to-left comparison, combined with the shift heuristics, allows for larger jumps in the text when mismatches occur.

We will now delve into the two main heuristics that drive the Boyer-Moore algorithm: the Bad Character Rule and the Good Suffix Rule.

Right-to-Left Comparison

Traditional naive string searching and algorithms like KMP perform comparisons from left to right. The Boyer-Moore algorithm innovatively reverses this direction within each potential alignment of the pattern against the text.

The advantage of right-to-left comparison is that it allows for the application of powerful shift heuristics based on characters encountered later in the pattern. If a mismatch occurs, these heuristics can often propose larger shifts by analyzing the mismatched character or the matched suffix, potentially skipping a significant portion of the text.

Shift Heuristics: Bad Character and Good Suffix Rules

To maximize efficiency, the Boyer-Moore algorithm employs two shift heuristics:

  • Bad Character Rule (BCR): This rule is based on the mismatched character in the text.

  • Good Suffix Rule (GSR): This rule is based on the suffix of the pattern that has been successfully matched before a mismatch occurs.

When a mismatch is found during the comparison, both rules suggest a possible shift amount. The Boyer-Moore algorithm then chooses the larger of the two shifts to move the pattern forward, aiming to skip as many characters in the text as possible. These heuristics are precalculated in a preprocessing step, allowing for quick shift decisions during the search phase. The goal of these heuristics is to reposition the pattern in a more advantageous way after a mismatch, maximizing the shift to avoid unnecessary comparisons. In favorable cases, especially with the Bad Character Rule, the algorithm can achieve sublinear time complexity, effectively skipping large portions of the text. For instance, consider searching for a pattern \(P\) in a text \(T\) where the characters of \(P\) are relatively rare in \(T\). In such scenarios, the Bad Character Rule can often propose shifts as large as the length of the pattern itself, leading to significant performance gains.

Bad Character Rule

The Bad Character Rule (BCR) leverages mismatches to determine a safe shift. During the comparison of the pattern \(P\) against a portion of the text \(T\), if a mismatch occurs at some position, and the character in the text at the point of mismatch is, say, \(c\), the BCR considers this character \(c\) (the "bad character").

The rule is defined as follows:

  • Case 1: Bad Character Not in Pattern: If the bad character \(c\) does not appear in the pattern \(P\) at all, then we can safely shift the pattern completely past the mismatched character. The shift length in this case is equal to the length of the pattern.

  • Case 2: Bad Character in Pattern: If the bad character \(c\) does appear in the pattern \(P\), we shift the pattern to the right to align the rightmost occurrence of \(c\) in \(P\) with the mismatched character \(c\) in \(T\). The shift length is the distance from the current mismatch position in the pattern to the rightmost occurrence of \(c\) in \(P\) to its left. If the rightmost occurrence of \(c\) is at or to the right of the mismatch position, a shift of 1 is performed.

Preprocessing for the Bad Character Rule

In the preprocessing phase, for each character in the alphabet \(\Sigma\), we need to store the position of its rightmost occurrence in the pattern \(P\). Let the pattern be \(P[1 \dots m]\) and the alphabet be \(\Sigma\). We create a table, often called ‘badChar’ or ‘lastOccurence’, of size \(|\Sigma|\). For each character \(x \in \Sigma\), ‘badChar[x]’ stores the index of the rightmost occurrence of \(x\) in \(P\). If \(x\) does not occur in \(P\), we can store a value that indicates this, such as -1 or 0, or even the pattern length \(m\) itself, depending on the implementation of the shift calculation.

For implementation simplicity, if a character \(c\) is not in \(P\), we can consider its last occurrence index as -1.

Applying the Bad Character Rule

Suppose we are at a current alignment of pattern \(P\) with text \(T\). We compare \(P\) with the corresponding substring of \(T\) from right to left. Let’s say a mismatch occurs at position \(j\) in \(P\) (from the left end, using 0-based indexing, \(j=0, 1, \dots, m-1\)) and the mismatched character in the text \(T\) is \(c = T[s+j]\), where \(s\) is the starting index of the current window in \(T\).

Let \(last\_occurrence(c)\) be the 0-based index of the rightmost occurrence of character \(c\) in \(P\), or -1 if \(c\) is not in \(P\). The shift suggested by BCR is calculated as follows:

If \(last\_occurrence(c) < 0\), it means \(c\) is not in \(P\). In this case, the shift is simply the length of the pattern, \(m\).

If \(last\_occurrence(c) \ge 0\), the shift is given by \(\max(1, j - last\_occurrence(c))\). We take the maximum with 1 to ensure a forward progress in all cases.

Example 1.

Let \(P= \text{CABAB}\) and \(T= \text{ABCABABAB}\).

Preprocessing for Bad Character Rule for \(P= \text{CABAB}\): Assuming alphabet \(\Sigma= \{A, B, C, \dots\}\)

Using 0-based indexing for positions in \(P\):

  • badChar[‘A’] = 3 (index of last ‘A’ in "CABAB")

  • badChar[‘B’] = 4 (index of last ‘B’ in "CABAB")

  • badChar[‘C’] = 0 (index of last ‘C’ in "CABAB")

  • For all other characters \(x \neq 'A', 'B', 'C'\), badChar[x] = -1.

Suppose we align \(P\) at the beginning of \(T\).

Text:     ABCABABAB
Pattern:  CABAB

Comparing from right to left:

  • ‘B’ matches ‘B’

  • ‘A’ matches ‘A’

  • ‘B’ matches ‘B’

  • ‘A’ in text mismatches ‘C’ in pattern. Mismatch at index 0 of \(P\) (character ‘C’). Bad character is ‘A’.

Mismatch at index \(j=0\) of \(P\). Bad character is \(c = 'A'\). \(last\_occurrence('A') = 3\). Shift\(_{BCR} = \max(1, j - last\_occurrence('A')) = \max(1, 0 - 3) = \max(1, -3) = 1\). So, we shift the pattern by 1 position to the right.

Next alignment:

Text:     ABCABABAB
Pattern:   CABAB

Comparing from right to left again:

  • ‘B’ matches ‘B’

  • ‘A’ matches ‘A’

  • ‘B’ matches ‘B’

  • ‘C’ matches ‘C’

  • ‘A’ matches ‘A’. Match found at position 1 of text.

Pattern found starting at index 1 of the text.

Good Suffix Rule

The Good Suffix Rule (GSR) is applied when a match of a suffix of the pattern is found before a mismatch occurs. It typically provides larger shifts than the Bad Character Rule in many cases.

Suppose we have matched a suffix \(S\) of the pattern, and then a mismatch occurs. The Good Suffix Rule considers two main cases to determine the shift:

  1. Case 1: Re-occurrence of Good Suffix: Look for another occurrence of the good suffix \(S\) within the pattern \(P\) itself. If found, shift \(P\) so that this other occurrence of \(S\) aligns with the matched suffix in the text. Ideally, we look for an occurrence that is not immediately preceded by the same character as the current occurrence to maximize the shift.

  2. Case 2: Prefix-Suffix Overlap: If Case 1 fails (i.e., no suitable re-occurrence of \(S\) is found), consider the longest prefix of \(P\) that is also a suffix of the matched good suffix \(S\). Shift \(P\) so that this prefix aligns with the corresponding suffix in the text. This is a fallback strategy to ensure progress even when a full re-occurrence of the good suffix is not possible.

Preprocessing for Good Suffix Rule

Preprocessing for GSR is more complex than for BCR. It involves precalculating shift distances based on all possible good suffixes. This typically requires constructing two auxiliary tables, often denoted as \(L'\) and \(l'\).

Definition 2.

Let \(P\in \Sigma^*\). For each position \(i\) in \(P\) (from 1 to \(m\), where \(m\) is the length of \(P\)), let \(L'(i)\) be the largest index such that:

  1. \(P[i \dots m]\) is a suffix of \(P[1 \dots L'(i)]\).

  2. And if \(i > 1\), then \(P[i-1] \neq P[L'(i)-(m-i+1)-1]\). (This condition ensures that the occurrence is not immediately preceded by the same character, aiming for a "further" distinct occurrence).

If no such \(L'(i)\) exists, \(L'(i) = 0\).

Theorem 3.

Description:* This theorem provides a method to compute \(L'(i)\) using \(N_j[P]\), where \(N_j[P]\) is the length of the longest suffix of \(P[1 \dots j]\) that is also a suffix of \(P\). Statement:** \(L'(i)\) is the maximum \(j\) such that \(N_{j}[P] = |P[i \dots m]| = m - i + 1\).*

Also, let \(N_j[P]\) be the length of the longest suffix of \(P[1 \dots j]\) that is also a suffix of \(P\).

This theorem provides a method to compute \(L'(i)\) using \(N_j[P]\). The value \(L'(i)\) is crucial for determining shifts based on the good suffix. The shift distance when a good suffix of length \(m-i+1\) is matched (and a mismatch occurs at position \(i-1\) from the start of the pattern) is given by \(m - L'(i)\).

In addition to \(L'\), another table, often denoted as \(l'\), is used to handle the case where no full re-occurrence of the good suffix is found (Case 2 above, prefix-suffix overlap). The details of calculating \(l'\) and the complete GSR shift table are more intricate and often involve concepts from string algorithms like the Z-algorithm or similar techniques for efficient suffix and prefix matching.

The notes mention "GSR si riduce a Z". The exact interpretation of ‘Z’ in this context is not definitively clear from the provided material. It is plausible that ‘Z’ refers to the Z-algorithm or Z-function, which is a linear-time algorithm used for string matching and finding repeated substrings. The Z-algorithm computes an array where each element \(Z[i]\) is the length of the longest common prefix between the string and the suffix starting at position \(i\). It’s possible that a variation or application of the Z-algorithm is used in efficient precomputation of the Good Suffix Rule shift tables. However, without further clarification from the lecture notes, the precise connection of ‘Z’ to GSR remains ambiguous and requires deeper investigation into specific Boyer-Moore implementations and related string algorithms.

Combining Bad Character and Good Suffix Rules

During the search phase of the Boyer-Moore algorithm, upon encountering a mismatch, we compute the shift suggestions from both the Bad Character Rule and the Good Suffix Rule. The algorithm then selects the larger of these two shifts to apply. This strategy ensures that we always choose the shift that allows us to skip the maximum possible number of characters in the text, thus optimizing the search process.

Mathematically, if \(shift_{BCR}\) is the shift suggested by the Bad Character Rule and \(shift_{GSR}\) is the shift suggested by the Good Suffix Rule, the actual shift applied by the Boyer-Moore algorithm is: \[\text{Shift} = \max(\text{Shift}_{BCR}, \text{Shift}_{GSR})\]

This combined approach leverages the strengths of both heuristics, leading to the Boyer-Moore algorithm’s renowned efficiency in practice.

Implementation Details

Bad Character Rule Implementation

For the Bad Character Rule, the most efficient and common implementation uses an array to store the last occurrence of each character in the pattern.

Array Implementation

For alphabets of reasonable size, such as ASCII or a limited Unicode range, an array ‘badChar’ indexed by character values is highly effective.

Algorithm 4.

Input: Pattern \(P\) of length \(m\), Alphabet \(\Sigma\)
Output: ‘badChar’ array storing last occurrence index for each character

Initialize an array ‘badChar’ of size \(|\Sigma|\), with all entries set to -1. \(char \leftarrow P[i]\) \(badChar[char] \leftarrow i\) return ‘badChar’

Complexity Analysis: The preprocessing takes \(O(m + |\Sigma|)\) time, where \(m\) is the length of the pattern and \(|\Sigma|\) is the size of the alphabet. Initialization of the array takes \(O(|\Sigma|)\) and iterating through the pattern takes \(O(m)\). The space complexity is \(O(|\Sigma|)\) to store the ‘badChar’ array.

During the search, when a mismatch occurs with a bad character \(c\) at position \(j\) in the pattern (0-based index), we can quickly retrieve the last occurrence index from ‘badChar[c]’ and calculate the shift as described in [subsection:Bad-Character-Rule-Applying].

Matrix Implementation (Less Efficient)

The notes mention a "matrice \(|\Sigma| \times |P|\) (avia)". This approach is less conventional and less efficient for implementing just the Bad Character Rule. It might refer to a matrix where each entry \((c, i)\) stores information related to character \(c\) at position \(i\) in the pattern. However, for the Bad Character Rule, we only need the rightmost occurrence of each character, which is efficiently handled by a simple array. Using a matrix would likely increase both space and time complexity without providing significant advantages for the BCR itself. Therefore, the array-based implementation is the standard and recommended approach for the Bad Character Rule.

Good Suffix Rule Implementation

Efficient implementation of the Good Suffix Rule is more complex and typically involves precomputing shift tables using algorithms that analyze suffixes and prefixes of the pattern. As indicated in the notes with the mention of ‘Z’, techniques related to the Z-algorithm or similar string processing methods might be employed to optimize the precomputation of these tables. Standard implementations often involve creating tables like \(L'\) and \(l'\) (or equivalent representations) to quickly determine the appropriate shift based on the length of the matched good suffix. Due to its complexity, a detailed algorithm for GSR precomputation is beyond the scope of a basic introduction, but it’s important to recognize that efficient implementations rely on sophisticated string algorithms to achieve optimal performance.

Complexity Analysis

Worst-Case Complexity

Despite its heuristics, the Boyer-Moore algorithm, in its basic form using just the Bad Character and Good Suffix rules as described, can still have a worst-case time complexity of \(O(n \cdot m)\), where \(n\) is the length of the text and \(m\) is the length of the pattern. This occurs in contrived scenarios, particularly with highly repetitive patterns and texts, where the shift heuristics might consistently suggest only small shifts. For example, consider searching for the pattern "AAAA" in the text "AAAAAAAAAAAA...". In such cases, the algorithm might degrade to behavior similar to naive string searching.

Average-Case Complexity

In practical scenarios, especially with typical text and patterns encountered in applications, the Boyer-Moore algorithm exhibits excellent average-case performance. The combination of right-to-left comparison and the powerful shift heuristics, particularly the Good Suffix Rule, allows the algorithm to skip significant portions of the text. The average-case time complexity is often considered to be sublinear, approaching \(O(n/m)\) in favorable situations. This sublinear behavior is what makes Boyer-Moore highly efficient for real-world text searching tasks.

Optimized Complexity: Apostolico-Giancarlo Algorithm

To address the worst-case quadratic time complexity, optimized variants of the Boyer-Moore algorithm have been developed. The Apostolico-Giancarlo algorithm is a notable example. By refining the Good Suffix Rule and incorporating memoization techniques to remember previously matched suffixes and avoid redundant comparisons, the Apostolico-Giancarlo algorithm achieves a guaranteed linear time complexity of \(O(n+m)\) in all cases, including the worst-case scenarios. This linear time complexity makes it theoretically optimal while retaining the practical efficiency of the Boyer-Moore approach.

Conclusion

The Boyer-Moore algorithm stands as a cornerstone in string searching algorithms, renowned for its practical efficiency and innovative use of right-to-left comparison and shift heuristics. The Bad Character Rule and Good Suffix Rule, when combined, enable Boyer-Moore to often skip large portions of the text, leading to sublinear average-case performance. While the basic algorithm has a quadratic worst-case complexity, optimized versions like Apostolico-Giancarlo achieve linear time guarantees, making the Boyer-Moore family of algorithms highly valuable tools in diverse applications requiring fast and efficient string searching.

Further considerations include the impact of alphabet size and pattern characteristics on performance. Larger alphabets generally enhance the effectiveness of the Bad Character Rule. Patterns with less repetition tend to benefit more from the Good Suffix Rule. Understanding these factors helps in appreciating the nuances of Boyer-Moore’s performance in different contexts.

It’s important to note that while Boyer-Moore is highly efficient in most practical cases, its performance can degrade under specific conditions. If there are very frequent occurrences of the pattern in the text, the algorithm’s behavior can approach that of a naive algorithm, diminishing the advantage of its heuristics. However, for typical text processing and search tasks, where patterns are not excessively repetitive within the text, Boyer-Moore and its optimized variants remain among the fastest string searching algorithms available. The Bad Character Rule is relatively simple to implement and provides a tangible speedup in many scenarios. The Good Suffix Rule, although more complex to implement, significantly enhances the algorithm’s ability to handle mismatches and achieve larger shifts, especially in cases where the alphabet size is smaller or patterns have internal repetitions.

In summary, the Boyer-Moore algorithm provides a significant advancement over simpler string searching methods by incorporating right-to-left comparison and powerful shift heuristics. Its practical efficiency and the existence of linear-time variations like Apostolico-Giancarlo solidify its place as a fundamental algorithm in computer science, widely applicable in areas such as text editors, search engines, bioinformatics, and network security, where efficient string matching is crucial.

Further thinking: Consider how the choice of alphabet size and pattern characteristics affects the performance of the Boyer-Moore algorithm. How would the algorithm perform with a very small alphabet versus a large alphabet? What types of patterns are best and worst-case scenarios for Boyer-Moore?

Exercises

  1. Consider the text \(T= \text{ABABABCBAB}\) and pattern \(P= \text{ABC}\). Manually trace the Boyer-Moore algorithm using only the Bad Character Rule to find occurrences of \(P\) in \(T\). Show the shifts and comparisons.

  2. Precompute the Bad Character Rule table for the pattern \(P= \text{banana}\) using the English alphabet.

  3. Explain in your own words the advantages of right-to-left comparison in the Boyer-Moore algorithm.

  4. Research and briefly describe the Apostolico-Giancarlo algorithm and how it achieves linear time complexity for string searching.

  5. Consider the pattern \(P= \text{ABAB}\) and text \(T= \text{ABABABAB}\). Analyze the performance of the Boyer-Moore algorithm (using both BCR and GSR conceptually) in this case. Are the shifts generally large or small?