Lezione Introduttiva: Algoritmi Avanzati

Author

Your Name

Published

February 8, 2025

Introduction

This introductory lecture for the Advanced Algorithms course will cover the course program, recommended textbooks, exam modalities, and practical details. We will then delve into our first algorithmic problem: exact pattern matching, setting the stage for more advanced techniques to be discussed in subsequent lectures.

Course Program

The course program will cover topics in advanced algorithms, focusing on efficient solutions for complex problems. Specific topics will be detailed in the course syllabus. The course aims to equip students with the theoretical foundations and practical tools necessary to design and analyze efficient algorithms for a variety of computational challenges.

Textbooks

The following textbooks are recommended for this course:

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield.
Randomized Algorithms by Rajeev Motwani and Prabhakar Raghavan.
Compact Data Structures: A Practical Approach by Gonzalo Navarro.

These books provide comprehensive coverage of the topics we will discuss and serve as excellent resources for further study. They offer different perspectives and in-depth analyses of the algorithms and data structures we will explore.

Exam Modalities

Details regarding the exam modalities will be provided separately. This will include information on exam format, grading criteria, and any specific requirements. Please refer to the course website or official announcements for detailed information as it becomes available.

Practical Details

Office Hours: Information on office hours and how to schedule appointments will be provided. Office hours are an excellent opportunity to discuss course material, ask questions, and receive individual guidance.
Schedule: The lecture schedule and any changes will be communicated through official channels. Please regularly check the course schedule to stay informed about lecture timings and any updates.
Communications: Announcements, lecture materials, and important updates will be communicated via [Specify communication method, e.g., course website, email list]. Please ensure you are subscribed to receive these communications to stay informed about all course-related matters.

Exact Pattern Matching

Problem Definition

We begin by considering a fundamental problem in computer science: Exact Pattern Matching. This problem is a cornerstone in various applications, from text editors to bioinformatics.

Given a text $T$ and a pattern $P$, both strings over an alphabet $\Sigma$, the exact pattern matching problem is to find all occurrences of the pattern $P$ within the text $T$.

Input:

Text: $T \in \Sigma^*$
Pattern: $P \in \Sigma^*$
Alphabet: $\Sigma$

Let $|T| = m$ be the length of the text and $|P| = n$ be the length of the pattern.

Output: Find all indices $i$ such that $T[i, i+n-1] = P$. In other words, find all starting positions of occurrences of $P$ in $T$. We can formally define the output set as: \[\{ i \mid T[i, i+n-1] = P \}\] where $T[i, i+n-1]$ denotes the substring of $T$ starting at index $i$ and of length $n$. We assume 1-based indexing for simplicity in notation here, but in implementation, 0-based indexing is often used.

Naive Algorithm

The most straightforward approach to solve the exact pattern matching problem is the Naive Algorithm. This algorithm serves as a basic solution and a starting point for understanding more efficient algorithms. It is characterized by its simplicity and ease of implementation.

Algorithm Steps

The Naive Algorithm works by sliding the pattern $P$ across the text $T$, character by character, and comparing the pattern with the corresponding substring of the text at each position.

$j \leftarrow 1$ $j \leftarrow j + 1$ Output "Pattern found at position $i$"

Complexity Analysis: The algorithm has a worst-case time complexity of $O(m \cdot n)$ because, in the worst case, for each of the $m-n+1$ possible starting positions in the text, we might compare up to $n$ characters of the pattern.

Complexity Analysis of Naive Algorithm

Remark. Remark 1 (Complexity Analysis of Naive Algorithm). In the worst-case scenario, for each starting position in the text, we might compare all $n$ characters of the pattern. Since there are $m-n+1$ possible starting positions, the worst-case time complexity of the naive algorithm is $O((m-n+1) \cdot n)$. In the case where $n$ is significantly smaller than $m$, this is approximately $O(m \cdot n)$. This is considered quadratic complexity in terms of the input sizes if we consider both $m$ and $n$ can be of similar magnitude in some cases.

Worst-Case Example: Consider text $T = \text{a}^{m}$ (a string of $m$ ‘a’s) and pattern $P = \text{a}^{n-1}\text{b}$ (a string of $n-1$ ’a’s followed by a ’b’). For almost every starting position, we will compare $n-1$ characters before finding a mismatch in the last character. This leads to a large number of comparisons, illustrating the quadratic worst-case behavior. Specifically, for each starting position $i$ from $1$ to $m-n+1$, the algorithm will compare $T[i, i+n-2]$ with $P[1, n-1]$ which are all ’a’s, and then compare $T[i+n-1]$ with $P[n]$. The mismatch will occur at the $n$-th comparison. This happens for almost all starting positions, resulting in approximately $(m-n+1) \times n$ comparisons.

Remark. Remark 2 (Worst-Case Example for Naive Algorithm). The worst-case scenario for the Naive Algorithm occurs when the text and pattern are designed to maximize comparisons at each shift. For instance, with text $T = \text{a}^{m}$ and pattern $P = \text{a}^{n-1}\text{b}$, almost every shift requires $n$ comparisons before a mismatch is found.

Remark. Remark 3 (Complexity Analysis Summary of Naive Algorithm). The time complexity of the Naive Algorithm can be further categorized:

Worst-case: $O(m \cdot n)$. This occurs in cases like $T = \text{a}^{m}$ and $P = \text{a}^{n-1}\text{b}$, or $T = (\text{a}^{n-1}\text{b})^{k}$ and $P = \text{a}^{n-1}\text{b}$.
Best-case: $O(m)$. This occurs when the pattern is not found in the text, and the first character of the pattern immediately mismatches with the text at each position. For example, if $T = \text{aaaaaa}$ and $P = \text{b}$. In this case, for each starting position, only one comparison is performed.
Average-case: The average-case complexity is more complex to analyze and depends on the alphabet size and the statistical properties of the text and pattern. However, for random text and patterns, the average case performance is often much better than the worst-case, closer to $O(m)$. If we assume a sufficiently large alphabet and random strings, the probability of a match at each position decreases, leading to fewer comparisons on average.

Inefficiency of Naive Algorithm

Remark. Remark 4 (Inefficiency of Naive Algorithm). The main weakness of the naive algorithm lies in its redundant comparisons. When a mismatch occurs during pattern comparison, the algorithm shifts the pattern by only one position and restarts the comparison from the beginning of the pattern. This can lead to re-examining characters in the text multiple times, making it inefficient for certain types of inputs. This is because after a partial match, the Naive Algorithm does not utilize the information gained from the matched characters.

Example of Redundancy: If we are searching for the pattern "ABAB" in the text "ABABABAB", and we have a partial match "ABA" at some position, when a mismatch occurs at the fourth character (comparing ‘B’ of the pattern with ‘A’ of the text), the naive algorithm completely restarts the comparison from the next position in the text. It forgets that we have already matched "ABA". Specifically, consider aligning the pattern at the first position of the text:

Text:     ABABABAB
Pattern:  ABAB
         |||X

A mismatch occurs at the 4th character. The Naive Algorithm then shifts the pattern by one position to the right and restarts the comparison from the beginning of the pattern:

Text:     ABABABAB
Pattern:   ABAB
         X

Notice that in an improved approach, we could potentially utilize the fact that we already matched "ABA" to make a more intelligent shift and avoid restarting from the beginning of the pattern. This redundancy motivates the need for more intelligent algorithms like the Knuth-Morris-Pratt (KMP) algorithm.

Example 1 (Redundancy in Naive Algorithm). Consider searching for "ABAB" in "ABABABAB". The Naive Algorithm restarts comparison from the beginning of the pattern after a mismatch, even after a partial match like "ABA", leading to redundant comparisons.

Knuth-Morris-Pratt (KMP) Algorithm Hint

To overcome the inefficiency of the naive algorithm, more advanced algorithms like the Knuth-Morris-Pratt (KMP) algorithm have been developed. KMP algorithm utilizes the concept of prefix-suffix of the pattern to avoid redundant comparisons and achieve linear time complexity. The KMP algorithm achieves a time complexity of $O(m+n)$, which is a significant improvement over the Naive Algorithm’s $O(m \cdot n)$ worst-case complexity. The efficiency of KMP comes from its ability to pre-process the pattern and determine intelligent shifts when mismatches occur, thus avoiding unnecessary comparisons and re-examinations of the text.

Prefix-Suffix

The concept of prefix-suffix (often called border in string algorithms literature) is crucial for optimizing pattern matching. It allows us to understand the internal structure of the pattern and use this information to avoid unnecessary comparisons. The prefix-suffix property essentially captures the repetitive structure within the pattern itself.

For a given pattern $P$, the prefix-suffix of $P[1, j]$ (the prefix of $P$ of length $j$) is the longest proper suffix of $P[1, j]$ that is also a prefix of $P$.

A proper suffix of a string is a suffix that is not equal to the string itself. The length of the prefix-suffix can be zero if no such suffix exists other than the empty string. The prefix-suffix for each prefix of the pattern will be pre-calculated and stored in a table, often called the prefix function or failure function, which is then used during the matching phase to determine the shift amount.

Example: Consider the pattern $P = \text{ABAB}$. Let’s compute the prefix-suffix for each prefix of $P$:

For $P[1, 1] = \text{A}$: Prefixes are $\{\epsilon, \text{A}\}$, proper suffixes are $\{\epsilon\}$. The longest proper suffix that is also a prefix is $\epsilon$ (empty string). Prefix-suffix length is 0.
For $P[1, 2] = \text{AB}$: Prefixes are $\{\epsilon, \text{A}, \text{AB}\}$, proper suffixes are $\{\epsilon, \text{B}\}$. The longest proper suffix that is also a prefix is $\epsilon$. Prefix-suffix length is 0.
For $P[1, 3] = \text{ABA}$: Prefixes are $\{\epsilon, \text{A}, \text{AB}, \text{ABA}\}$, proper suffixes are $\{\epsilon, \text{A}, \text{BA}\}$. The longest proper suffix that is also a prefix is $\text{A}$. Prefix-suffix length is 1.
For $P[1, 4] = \text{ABAB}$: Prefixes are $\{\epsilon, \text{A}, \text{AB}, \text{ABA}, \text{ABAB}\}$, proper suffixes are $\{\epsilon, \text{B}, \text{AB}, \text{BAB}\}$. The longest proper suffix that is also a prefix is $\text{AB}$. Prefix-suffix length is 2.

Example 2 (Prefix-Suffix Calculation for "ABAB"). For the pattern $P = \text{ABAB}$, the prefix-suffix lengths for prefixes "A", "AB", "ABA", and "ABAB" are 0, 0, 1, and 2, respectively. This illustrates how prefix-suffix captures the repetitive structure within the pattern.

The prefix-suffix information helps in determining how much to shift the pattern to the right after a mismatch, without missing any potential occurrences. This intelligent shifting is the key to the efficiency of the KMP algorithm. By knowing the prefix-suffix for each prefix of the pattern, we can effectively reuse the already matched part of the text and avoid redundant comparisons. For example, if we are matching "ABAB" and we have matched "ABA" and then encounter a mismatch, the prefix-suffix of "ABA" is "A" (length 1). This tells us that we can shift the pattern such that the prefix "A" of the pattern aligns with the suffix "A" of the already matched text, effectively skipping redundant comparisons.

Recursive Algorithm for Prefix-Suffix

A recursive approach can be used to understand how to determine the prefix-suffix of a pattern. To determine the maximum prefix-suffix of $P[1, j]$, we can utilize the maximum prefix-suffix of $P[1, j-1]$. Let’s denote the length of the prefix-suffix of $P[1, j]$ as $\text{PS}(j)$.

Remark. Remark 5 (Recursive Idea for Prefix-Suffix Calculation). To compute the prefix-suffix length for a prefix $P[1, j]$, we can use the already computed prefix-suffix length for $P[1, j-1]$. This recursive approach helps in understanding the relationship between prefix-suffixes of increasing lengths.

Recursive Idea: To compute $\text{PS}(j)$, we assume we have already computed $\text{PS}(j-1) = k$. This means that $P[1, k] = P[j-k, j-1]$ is the longest proper prefix that is also a suffix of $P[1, j-1]$. Now we want to find the longest proper prefix that is also a suffix of $P[1, j]$.

We consider two cases:

Case 1: $P[j] = P[k+1]$. If the character at position $j$ matches the character at position $k+1$ in the pattern, it means we can extend the prefix-suffix of $P[1, j-1]$ of length $k$ by one more character. Therefore, the new prefix-suffix length for $P[1, j]$ becomes $k+1$. So, $\text{PS}(j) = k+1$.
Case 2: $P[j] \neq P[k+1]$. If there is a mismatch, we cannot simply extend the prefix-suffix of length $k$. We need to find a shorter prefix that is also a suffix. To do this, we consider the prefix-suffix of the prefix $P[1, k]$. Let $k' = \text{PS}(k)$. This means $P[1, k'] = P[k-k'+1, k]$. Now we check if $P[j] = P[k'+1]$.
- If $P[j] = P[k'+1]$, then we have found a prefix-suffix of length $k'+1$ for $P[1, j]$, so $\text{PS}(j) = k'+1$.
- If $P[j] \neq P[k'+1]$, we repeat the process by considering an even shorter prefix-suffix. We set $k = k'$ and repeat the check $P[j] = P[k+1]$. We continue this process until we find a match (Case 1) or $k$ becomes 0.
- If $k$ becomes 0, we check if $P[j] = P[1]$.
  - If $P[j] = P[1]$, then the prefix-suffix length is 1, so $\text{PS}(j) = 1$.
  - If $P[j] \neq P[1]$, then there is no non-empty prefix-suffix, so the prefix-suffix length is 0, $\text{PS}(j) = 0$.

Remark. Remark 6 (Note on Recursive Prefix-Suffix Algorithm). This recursive description is conceptual. Efficient implementations of prefix-suffix calculation, especially for KMP algorithm, typically use dynamic programming for linear time computation, avoiding redundant calculations inherent in a purely recursive approach.

Exercises

Implementation of the Naive Algorithm: Implement the Naive Algorithm for exact pattern matching in your preferred programming language. Test its correctness and performance with various texts and patterns. Specifically, test it with:
- Simple cases with short text and pattern.
- Worst-case examples like $T = \text{a}^{m}$ and $P = \text{a}^{n-1}\text{b}$ to observe its quadratic behavior.
- Cases where the pattern does not occur in the text.
- Cases with multiple occurrences of the pattern.
Measure the execution time for different input sizes to empirically verify the time complexity.
Average-Case Complexity Analysis: Analyze the average-case time complexity of the Naive Algorithm. Consider an alphabet of size $|\Sigma| > 1$.
- What kind of inputs (text and pattern characteristics) would lead to better performance than the worst-case? Think about the probability of character matches in random strings.
- Experiment with randomly generated texts and patterns. How does the performance change as you vary the alphabet size and the lengths of the text and pattern?
Prefix-Suffix Calculation for "ababa": Consider the pattern $P = \text{ababa}$. Calculate the prefix-suffix length for each prefix of $P$ (i.e., for $P[1, 1], P[1, 2], P[1, 3], P[1, 4], P[1, 5]$). Show the prefixes and their corresponding longest proper prefix that is also a suffix. This exercise will help you understand the prefix-suffix concept concretely.
Optimizing Pattern Shifting with Prefix-Suffix: Think about how the prefix-suffix information could be used to optimize the pattern shifting in case of a mismatch in the pattern matching process.
- Suppose you are using the Naive Algorithm and you have matched a prefix of the pattern, but then a mismatch occurs. How can the prefix-suffix information of the matched prefix help you decide how much to shift the pattern to the right?
- Sketch out a potential algorithm (in pseudocode or a high-level description) that utilizes the pre-calculated prefix-suffix information to improve upon the Naive Algorithm. Consider what information you would need to pre-calculate for the pattern and how you would use it during the text scanning phase. (Hint: Think about how to avoid restarting the comparison from the beginning of the pattern after a mismatch.)

Conclusion

In this introductory lecture, we have laid the groundwork for our exploration of advanced algorithms by:

Introducing the course program, recommended textbooks, exam modalities, and practical details for the Advanced Algorithms course, setting the stage for the semester and providing essential logistical information.
Defining the fundamental problem of Exact Pattern Matching, highlighting its importance as a basic building block in numerous computer science applications and setting the context for the course’s focus on efficient problem-solving.
Discussing the Naive Algorithm for exact pattern matching, presenting a simple and intuitive solution that serves as a baseline for comparison and further optimization.
Analyzing the worst-case quadratic time complexity of the Naive Algorithm, understanding its limitations and inefficiencies when dealing with large inputs or specific types of patterns and texts.
Identifying the inefficiency of the naive algorithm due to redundant comparisons, specifically illustrating how it re-examines parts of the text unnecessarily after mismatches, thus motivating the need for more sophisticated and efficient approaches.
Briefly introducing the concept of prefix-suffix and hinting at its crucial role in developing more efficient pattern matching algorithms, particularly the Knuth-Morris-Pratt (KMP) algorithm. We emphasized that prefix-suffix information allows for intelligent pattern shifts, avoiding redundant comparisons.

This lecture serves as a crucial foundation for our journey into advanced algorithms. We started with a simple, yet inefficient, algorithm and identified its weaknesses. In the next lectures, we will delve deeper into the Knuth-Morris-Pratt (KMP) algorithm, learning how to efficiently compute the prefix-suffix function and how to use it to achieve linear time complexity for exact pattern matching. The KMP algorithm will demonstrate the power of preprocessing and utilizing internal pattern structure to significantly improve algorithmic performance. We encourage you to review the material, attempt the exercises, and especially think about Exercise 4, as it directly leads to the core ideas behind the KMP algorithm. Understanding the limitations of the Naive Algorithm and the potential of prefix-suffix optimization is key to appreciating the elegance and efficiency of algorithms like KMP that we will explore in the upcoming lectures.

--- title: "Lezione Introduttiva: Algoritmi Avanzati" author: "Your Name" date: "2025-02-08" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction This introductory lecture for the Advanced Algorithms course will cover the course program, recommended textbooks, exam modalities, and practical details. We will then delve into our first algorithmic problem: exact pattern matching, setting the stage for more advanced techniques to be discussed in subsequent lectures. ## Course Program The course program will cover topics in advanced algorithms, focusing on efficient solutions for complex problems. Specific topics will be detailed in the course syllabus. The course aims to equip students with the theoretical foundations and practical tools necessary to design and analyze efficient algorithms for a variety of computational challenges. ## Textbooks The following textbooks are recommended for this course: 1. **Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology** by Dan Gusfield. 2. **Randomized Algorithms** by Rajeev Motwani and Prabhakar Raghavan. 3. **Compact Data Structures: A Practical Approach** by Gonzalo Navarro. These books provide comprehensive coverage of the topics we will discuss and serve as excellent resources for further study. They offer different perspectives and in-depth analyses of the algorithms and data structures we will explore. ## Exam Modalities Details regarding the exam modalities will be provided separately. This will include information on exam format, grading criteria, and any specific requirements. Please refer to the course website or official announcements for detailed information as it becomes available. ## Practical Details - **Office Hours:** Information on office hours and how to schedule appointments will be provided. Office hours are an excellent opportunity to discuss course material, ask questions, and receive individual guidance. - **Schedule:** The lecture schedule and any changes will be communicated through official channels. Please regularly check the course schedule to stay informed about lecture timings and any updates. - **Communications:** Announcements, lecture materials, and important updates will be communicated via \[Specify communication method, e.g., course website, email list\]. Please ensure you are subscribed to receive these communications to stay informed about all course-related matters. # Exact Pattern Matching ## Problem Definition We begin by considering a fundamental problem in computer science: **Exact Pattern Matching**. This problem is a cornerstone in various applications, from text editors to bioinformatics. ::: tcolorbox Given a text $T$ and a pattern $P$, both strings over an alphabet $\Sigma$, the exact pattern matching problem is to find all occurrences of the pattern $P$ within the text $T$. ::: **Input:** - Text: $T \in \Sigma^*$ - Pattern: $P \in \Sigma^*$ - Alphabet: $\Sigma$ Let $|T| = m$ be the length of the text and $|P| = n$ be the length of the pattern. **Output:** Find all indices $i$ such that $T[i, i+n-1] = P$. In other words, find all starting positions of occurrences of $P$ in $T$. We can formally define the output set as: $$\{ i \mid T[i, i+n-1] = P \}$$ where $T[i, i+n-1]$ denotes the substring of $T$ starting at index $i$ and of length $n$. We assume 1-based indexing for simplicity in notation here, but in implementation, 0-based indexing is often used. ## Naive Algorithm The most straightforward approach to solve the exact pattern matching problem is the **Naive Algorithm**. This algorithm serves as a basic solution and a starting point for understanding more efficient algorithms. It is characterized by its simplicity and ease of implementation. ### Algorithm Steps The Naive Algorithm works by sliding the pattern $P$ across the text $T$, character by character, and comparing the pattern with the corresponding substring of the text at each position. ::::: tcolorbox :::: algorithm ::: algorithmic $j \leftarrow 1$ $j \leftarrow j + 1$ Output \"Pattern found at position $i$\" ::: :::: ::::: **Complexity Analysis:** The algorithm has a worst-case time complexity of $O(m \cdot n)$ because, in the worst case, for each of the $m-n+1$ possible starting positions in the text, we might compare up to $n$ characters of the pattern. ### Complexity Analysis of Naive Algorithm ::: remark **Remark 1** (Complexity Analysis of Naive Algorithm). *In the worst-case scenario, for each starting position in the text, we might compare all $n$ characters of the pattern. Since there are $m-n+1$ possible starting positions, the worst-case time complexity of the naive algorithm is $O((m-n+1) \cdot n)$. In the case where $n$ is significantly smaller than $m$, this is approximately $O(m \cdot n)$. This is considered **quadratic** complexity in terms of the input sizes if we consider both $m$ and $n$ can be of similar magnitude in some cases.* ::: **Worst-Case Example:** Consider text $T = \text{a}^{m}$ (a string of $m$ 'a's) and pattern $P = \text{a}^{n-1}\text{b}$ (a string of $n-1$ 'a's followed by a 'b'). For almost every starting position, we will compare $n-1$ characters before finding a mismatch in the last character. This leads to a large number of comparisons, illustrating the quadratic worst-case behavior. Specifically, for each starting position $i$ from $1$ to $m-n+1$, the algorithm will compare $T[i, i+n-2]$ with $P[1, n-1]$ which are all 'a's, and then compare $T[i+n-1]$ with $P[n]$. The mismatch will occur at the $n$-th comparison. This happens for almost all starting positions, resulting in approximately $(m-n+1) \times n$ comparisons. ::: remark **Remark 2** (Worst-Case Example for Naive Algorithm). *The worst-case scenario for the Naive Algorithm occurs when the text and pattern are designed to maximize comparisons at each shift. For instance, with text $T = \text{a}^{m}$ and pattern $P = \text{a}^{n-1}\text{b}$, almost every shift requires $n$ comparisons before a mismatch is found.* ::: ::: remark **Remark 3** (Complexity Analysis Summary of Naive Algorithm). *The time complexity of the Naive Algorithm can be further categorized:* - ***Worst-case:** $O(m \cdot n)$. This occurs in cases like $T = \text{a}^{m}$ and $P = \text{a}^{n-1}\text{b}$, or $T = (\text{a}^{n-1}\text{b})^{k}$ and $P = \text{a}^{n-1}\text{b}$.* - ***Best-case:** $O(m)$. This occurs when the pattern is not found in the text, and the first character of the pattern immediately mismatches with the text at each position. For example, if $T = \text{aaaaaa}$ and $P = \text{b}$. In this case, for each starting position, only one comparison is performed.* - ***Average-case:** The average-case complexity is more complex to analyze and depends on the alphabet size and the statistical properties of the text and pattern. However, for random text and patterns, the average case performance is often much better than the worst-case, closer to $O(m)$. If we assume a sufficiently large alphabet and random strings, the probability of a match at each position decreases, leading to fewer comparisons on average.* ::: ### Inefficiency of Naive Algorithm ::: remark **Remark 4** (Inefficiency of Naive Algorithm). *The main weakness of the naive algorithm lies in its redundant comparisons. When a mismatch occurs during pattern comparison, the algorithm shifts the pattern by only one position and restarts the comparison from the beginning of the pattern. This can lead to re-examining characters in the text multiple times, making it inefficient for certain types of inputs. This is because after a partial match, the Naive Algorithm does not utilize the information gained from the matched characters.* ::: **Example of Redundancy:** If we are searching for the pattern \"ABAB\" in the text \"ABABABAB\", and we have a partial match \"ABA\" at some position, when a mismatch occurs at the fourth character (comparing 'B' of the pattern with 'A' of the text), the naive algorithm completely restarts the comparison from the next position in the text. It forgets that we have already matched \"ABA\". Specifically, consider aligning the pattern at the first position of the text: Text: ABABABAB Pattern: ABAB |||X A mismatch occurs at the 4th character. The Naive Algorithm then shifts the pattern by one position to the right and restarts the comparison from the beginning of the pattern: Text: ABABABAB Pattern: ABAB X Notice that in an improved approach, we could potentially utilize the fact that we already matched \"ABA\" to make a more intelligent shift and avoid restarting from the beginning of the pattern. This redundancy motivates the need for more intelligent algorithms like the Knuth-Morris-Pratt (KMP) algorithm. ::: example **Example 1** (Redundancy in Naive Algorithm). *Consider searching for \"ABAB\" in \"ABABABAB\". The Naive Algorithm restarts comparison from the beginning of the pattern after a mismatch, even after a partial match like \"ABA\", leading to redundant comparisons.* ::: ## Knuth-Morris-Pratt (KMP) Algorithm Hint To overcome the inefficiency of the naive algorithm, more advanced algorithms like the **Knuth-Morris-Pratt (KMP) algorithm** have been developed. KMP algorithm utilizes the concept of **prefix-suffix** of the pattern to avoid redundant comparisons and achieve linear time complexity. The KMP algorithm achieves a time complexity of $O(m+n)$, which is a significant improvement over the Naive Algorithm's $O(m \cdot n)$ worst-case complexity. The efficiency of KMP comes from its ability to pre-process the pattern and determine intelligent shifts when mismatches occur, thus avoiding unnecessary comparisons and re-examinations of the text. ### Prefix-Suffix The concept of **prefix-suffix** (often called border in string algorithms literature) is crucial for optimizing pattern matching. It allows us to understand the internal structure of the pattern and use this information to avoid unnecessary comparisons. The prefix-suffix property essentially captures the repetitive structure within the pattern itself. ::: tcolorbox For a given pattern $P$, the prefix-suffix of $P[1, j]$ (the prefix of $P$ of length $j$) is the longest proper suffix of $P[1, j]$ that is also a prefix of $P$. ::: A *proper* suffix of a string is a suffix that is not equal to the string itself. The length of the prefix-suffix can be zero if no such suffix exists other than the empty string. The prefix-suffix for each prefix of the pattern will be pre-calculated and stored in a table, often called the prefix function or failure function, which is then used during the matching phase to determine the shift amount. **Example:** Consider the pattern $P = \text{ABAB}$. Let's compute the prefix-suffix for each prefix of $P$: - For $P[1, 1] = \text{A}$: Prefixes are $\{\epsilon, \text{A}\}$, proper suffixes are $\{\epsilon\}$. The longest proper suffix that is also a prefix is $\epsilon$ (empty string). Prefix-suffix length is 0. - For $P[1, 2] = \text{AB}$: Prefixes are $\{\epsilon, \text{A}, \text{AB}\}$, proper suffixes are $\{\epsilon, \text{B}\}$. The longest proper suffix that is also a prefix is $\epsilon$. Prefix-suffix length is 0. - For $P[1, 3] = \text{ABA}$: Prefixes are $\{\epsilon, \text{A}, \text{AB}, \text{ABA}\}$, proper suffixes are $\{\epsilon, \text{A}, \text{BA}\}$. The longest proper suffix that is also a prefix is $\text{A}$. Prefix-suffix length is 1. - For $P[1, 4] = \text{ABAB}$: Prefixes are $\{\epsilon, \text{A}, \text{AB}, \text{ABA}, \text{ABAB}\}$, proper suffixes are $\{\epsilon, \text{B}, \text{AB}, \text{BAB}\}$. The longest proper suffix that is also a prefix is $\text{AB}$. Prefix-suffix length is 2. ::: example **Example 2** (Prefix-Suffix Calculation for \"ABAB\"). *For the pattern $P = \text{ABAB}$, the prefix-suffix lengths for prefixes \"A\", \"AB\", \"ABA\", and \"ABAB\" are 0, 0, 1, and 2, respectively. This illustrates how prefix-suffix captures the repetitive structure within the pattern.* ::: The prefix-suffix information helps in determining how much to shift the pattern to the right after a mismatch, without missing any potential occurrences. This intelligent shifting is the key to the efficiency of the KMP algorithm. By knowing the prefix-suffix for each prefix of the pattern, we can effectively reuse the already matched part of the text and avoid redundant comparisons. For example, if we are matching \"ABAB\" and we have matched \"ABA\" and then encounter a mismatch, the prefix-suffix of \"ABA\" is \"A\" (length 1). This tells us that we can shift the pattern such that the prefix \"A\" of the pattern aligns with the suffix \"A\" of the already matched text, effectively skipping redundant comparisons. ### Recursive Algorithm for Prefix-Suffix A recursive approach can be used to understand how to determine the prefix-suffix of a pattern. To determine the maximum prefix-suffix of $P[1, j]$, we can utilize the maximum prefix-suffix of $P[1, j-1]$. Let's denote the length of the prefix-suffix of $P[1, j]$ as $\text{PS}(j)$. ::: remark **Remark 5** (Recursive Idea for Prefix-Suffix Calculation). *To compute the prefix-suffix length for a prefix $P[1, j]$, we can use the already computed prefix-suffix length for $P[1, j-1]$. This recursive approach helps in understanding the relationship between prefix-suffixes of increasing lengths.* ::: **Recursive Idea:** To compute $\text{PS}(j)$, we assume we have already computed $\text{PS}(j-1) = k$. This means that $P[1, k] = P[j-k, j-1]$ is the longest proper prefix that is also a suffix of $P[1, j-1]$. Now we want to find the longest proper prefix that is also a suffix of $P[1, j]$. We consider two cases: 1. **Case 1: $P[j] = P[k+1]$**. If the character at position $j$ matches the character at position $k+1$ in the pattern, it means we can extend the prefix-suffix of $P[1, j-1]$ of length $k$ by one more character. Therefore, the new prefix-suffix length for $P[1, j]$ becomes $k+1$. So, $\text{PS}(j) = k+1$. 2. **Case 2: $P[j] \neq P[k+1]$**. If there is a mismatch, we cannot simply extend the prefix-suffix of length $k$. We need to find a shorter prefix that is also a suffix. To do this, we consider the prefix-suffix of the prefix $P[1, k]$. Let $k' = \text{PS}(k)$. This means $P[1, k'] = P[k-k'+1, k]$. Now we check if $P[j] = P[k'+1]$. - If $P[j] = P[k'+1]$, then we have found a prefix-suffix of length $k'+1$ for $P[1, j]$, so $\text{PS}(j) = k'+1$. - If $P[j] \neq P[k'+1]$, we repeat the process by considering an even shorter prefix-suffix. We set $k = k'$ and repeat the check $P[j] = P[k+1]$. We continue this process until we find a match (Case 1) or $k$ becomes 0. - If $k$ becomes 0, we check if $P[j] = P[1]$. - If $P[j] = P[1]$, then the prefix-suffix length is 1, so $\text{PS}(j) = 1$. - If $P[j] \neq P[1]$, then there is no non-empty prefix-suffix, so the prefix-suffix length is 0, $\text{PS}(j) = 0$. ::: remark **Remark 6** (Note on Recursive Prefix-Suffix Algorithm). *This recursive description is conceptual. Efficient implementations of prefix-suffix calculation, especially for KMP algorithm, typically use dynamic programming for linear time computation, avoiding redundant calculations inherent in a purely recursive approach.* ::: # Exercises 1. **Implementation of the Naive Algorithm:** Implement the Naive Algorithm for exact pattern matching in your preferred programming language. Test its correctness and performance with various texts and patterns. Specifically, test it with: - Simple cases with short text and pattern. - Worst-case examples like $T = \text{a}^{m}$ and $P = \text{a}^{n-1}\text{b}$ to observe its quadratic behavior. - Cases where the pattern does not occur in the text. - Cases with multiple occurrences of the pattern. Measure the execution time for different input sizes to empirically verify the time complexity. 2. **Average-Case Complexity Analysis:** Analyze the average-case time complexity of the Naive Algorithm. Consider an alphabet of size $|\Sigma| > 1$. - What kind of inputs (text and pattern characteristics) would lead to better performance than the worst-case? Think about the probability of character matches in random strings. - Experiment with randomly generated texts and patterns. How does the performance change as you vary the alphabet size and the lengths of the text and pattern? 3. **Prefix-Suffix Calculation for \"ababa\":** Consider the pattern $P = \text{ababa}$. Calculate the prefix-suffix length for each prefix of $P$ (i.e., for $P[1, 1], P[1, 2], P[1, 3], P[1, 4], P[1, 5]$). Show the prefixes and their corresponding longest proper prefix that is also a suffix. This exercise will help you understand the prefix-suffix concept concretely. 4. **Optimizing Pattern Shifting with Prefix-Suffix:** Think about how the prefix-suffix information could be used to optimize the pattern shifting in case of a mismatch in the pattern matching process. - Suppose you are using the Naive Algorithm and you have matched a prefix of the pattern, but then a mismatch occurs. How can the prefix-suffix information of the matched prefix help you decide how much to shift the pattern to the right? - Sketch out a potential algorithm (in pseudocode or a high-level description) that utilizes the pre-calculated prefix-suffix information to improve upon the Naive Algorithm. Consider what information you would need to pre-calculate for the pattern and how you would use it during the text scanning phase. (Hint: Think about how to avoid restarting the comparison from the beginning of the pattern after a mismatch.) # Conclusion In this introductory lecture, we have laid the groundwork for our exploration of advanced algorithms by: - Introducing the course program, recommended textbooks, exam modalities, and practical details for the Advanced Algorithms course, setting the stage for the semester and providing essential logistical information. - Defining the fundamental problem of Exact Pattern Matching, highlighting its importance as a basic building block in numerous computer science applications and setting the context for the course's focus on efficient problem-solving. - Discussing the Naive Algorithm for exact pattern matching, presenting a simple and intuitive solution that serves as a baseline for comparison and further optimization. - Analyzing the worst-case quadratic time complexity of the Naive Algorithm, understanding its limitations and inefficiencies when dealing with large inputs or specific types of patterns and texts. - Identifying the inefficiency of the naive algorithm due to redundant comparisons, specifically illustrating how it re-examines parts of the text unnecessarily after mismatches, thus motivating the need for more sophisticated and efficient approaches. - Briefly introducing the concept of prefix-suffix and hinting at its crucial role in developing more efficient pattern matching algorithms, particularly the Knuth-Morris-Pratt (KMP) algorithm. We emphasized that prefix-suffix information allows for intelligent pattern shifts, avoiding redundant comparisons. This lecture serves as a crucial foundation for our journey into advanced algorithms. We started with a simple, yet inefficient, algorithm and identified its weaknesses. In the next lectures, we will delve deeper into the Knuth-Morris-Pratt (KMP) algorithm, learning how to efficiently compute the prefix-suffix function and how to use it to achieve linear time complexity for exact pattern matching. The KMP algorithm will demonstrate the power of preprocessing and utilizing internal pattern structure to significantly improve algorithmic performance. We encourage you to review the material, attempt the exercises, and especially think about Exercise 4, as it directly leads to the core ideas behind the KMP algorithm. Understanding the limitations of the Naive Algorithm and the potential of prefix-suffix optimization is key to appreciating the elegance and efficiency of algorithms like KMP that we will explore in the upcoming lectures.