Keyword Tree for Exact Pattern Matching

Author

Your Name

Published

February 8, 2025

Introduction

In the field of string algorithms, pattern matching is a fundamental problem. We often need to find occurrences of a given pattern within a larger text. A more generalized version of this problem is to search for multiple patterns simultaneously within a text, known as the multiple pattern matching problem.

Formally, given a text $T$ and a set of patterns $\mathcal{P}= \{P_1, P_2, \ldots, P_z\}$, the objective is to find all occurrences of all patterns in $\mathcal{P}$ within the text $T$.

A naive approach involves iterating through each pattern in $\mathcal{P}$ and using a single pattern matching algorithm (like Knuth-Morris-Pratt or Boyer-Moore) to search for it in $T$. Using an efficient algorithm like KMP, which has a time complexity of $O(|T| + |P_i|)$ for each pattern $P_i$, the total time complexity for searching all $z$ patterns becomes the sum of individual search times. This is approximately $O(\sum_{i=1}^{z} (|T| + |P_i|)) = O(z \cdot |T| + \sum_{i=1}^{z} |P_i|)$. If we denote the sum of pattern lengths $m = \sum_{i=1}^{z} |P_i|$ and consider the maximum pattern length $L_{max} = \max_{i}|P_i|$, the complexity can also be expressed as $O(z \cdot (|T| + L_{max}))$ or roughly $O(z|T| + m)$.

While correct, this approach is inefficient, especially when the number of patterns $z$ is large, or when we need to perform this search repeatedly over the same set of patterns but with different texts. The primary inefficiency stems from scanning the text $T$ multiple times, once for each pattern in $\mathcal{P}$.

The naive approach to multiple pattern matching suffers from a time complexity that scales linearly with the number of patterns, $z$. For each pattern, the entire text $T$ is potentially scanned. This redundancy becomes significant when searching for a large collection of patterns, leading to substantial computational overhead. The goal is to reduce this dependency on $z$ and process the text more efficiently, ideally in a single pass, regardless of the number of patterns.

To overcome this inefficiency, we can employ a data structure known as the Keyword Tree, also referred to as a Trie or Prefix Tree. The Keyword Tree allows us to process the text $T$ only once, simultaneously searching for all patterns in $\mathcal{P}$. The core idea is to combine all patterns into a single tree structure that exploits shared prefixes among the patterns. This shared prefix structure enables a single traversal of the text to identify occurrences of multiple patterns concurrently.

This lecture will delve into the concept of Keyword Trees, detailing their structure, construction algorithms, and their application in achieving efficient multiple pattern matching. We will explore how failure links are integrated into the Keyword Tree to optimize the search process, leading to algorithms with significantly improved performance compared to naive methods.

Keyword Tree: Definition and Properties

Definition

Given a set of patterns $\mathcal{P}= \{P_1, P_2, \ldots, P_z\}$ over an alphabet $\Sigma$, the keyword tree $K$ (also known as a Trie or Prefix Tree) for $\mathcal{P}$ is a rooted tree characterized by the following properties:

Edge Labels: Each edge in the tree is labeled with a single character from the alphabet $\Sigma$.
Distinct Outgoing Edge Labels: For any node in the tree, the labels on the outgoing edges must be distinct. This ensures that for a given character, there is at most one path to follow from any node.
Prefix Representation: Each node in the keyword tree represents a prefix of one or more patterns in $\mathcal{P}$. Specifically, the string formed by concatenating the edge labels on the path from the root to a node $v$ is a prefix of at least one pattern in the set $\mathcal{P}$.
Pattern End Nodes (Output Nodes): For each pattern $P_i = c_1c_2\ldots c_{|P_i|} \in \mathcal{P}$, there exists a unique node $v$ in $K$ such that the path from the root to $v$ is spelled out by the characters of $P_i$, i.e., the edges on this path are labeled $c_1, c_2, \ldots, c_{|P_i|}$ in sequence. These nodes, representing the end of a pattern, are designated as output nodes. We can mark these nodes or associate them with the pattern(s) that end at that node. Conversely, every path from the root to an output node spells out a pattern in $\mathcal{P}$.

Keyword Trees are designed to efficiently store and retrieve a set of strings by leveraging shared prefixes. This structure is particularly effective for tasks involving prefix-based searches and multiple pattern matching.

Consider the pattern set $\mathcal{P}= \{\text{potato, tatoo, theater, other}\}$. The Keyword Tree for $\mathcal{P}$ visually represents the patterns, sharing prefixes like ‘t’ in ‘tatoo’ and ‘theater’, and ‘o’ in ‘other’ and ‘potato’. Figure 1 illustrates such a Keyword Tree. Observe how paths from the root to the output nodes (marked with double circles) spell out the patterns in $\mathcal{P}$.

Keyword Tree for (= {}). Output nodes (representing the end of patterns) are marked with double circles.

Properties and Dimensions

Let $K(\mathcal{P})$ be the keyword tree constructed for a pattern set $\mathcal{P}$. Keyword Trees exhibit several key properties that are crucial for understanding their efficiency and utility.

Prefix Sharing and Space Efficiency: The most significant advantage of Keyword Trees is their efficient representation of strings through prefix sharing. Common prefixes among patterns are represented by shared paths from the root. This drastically reduces redundancy, especially when dealing with a large set of patterns that share common initial substrings. In terms of space complexity, this sharing can lead to a substantial reduction compared to storing each pattern individually.
Path Representation of Prefixes and Patterns: Every path from the root to any node in $K(\mathcal{P})$ corresponds to a prefix of at least one pattern in $\mathcal{P}$. Specifically, paths from the root to the designated output nodes precisely represent the complete patterns in $\mathcal{P}$. This direct path-to-string correspondence is fundamental to the pattern matching process.
Number of Arcs (Edges): The total number of arcs in $K(\mathcal{P})$ is directly related to the total length of all patterns in $\mathcal{P}$ and the extent of prefix sharing. Let $m = \sum_{P_i \in \mathcal{P}} |P_i|$ be the sum of the lengths of all patterns. In the worst-case scenario, where there is no prefix sharing among patterns (e.g., patterns are completely distinct), the number of arcs approaches $m$. However, with significant prefix sharing, the actual number of arcs will be considerably less than $m$. A tighter bound is that the number of arcs is at most $m$, and in practice, it is often much smaller due to prefix compression.
Number of Nodes: In any tree, the number of nodes is always one more than the number of arcs (excluding the root if arcs are counted as starting from the root). Therefore, the number of nodes in $K(\mathcal{P})$ is also on the order of $m$ in the worst case, but typically less due to prefix sharing. The root node is the starting point, and each arc leads to a new node, so the node count is closely tied to the arc count.
Height of the Tree and Maximum Pattern Length: The height of the Keyword Tree $K(\mathcal{P})$ is determined by the length of the longest pattern in the set $\mathcal{P}$. Let $L_{max} = \max_{P_i \in \mathcal{P}} |P_i|$. Then, the height of $K(\mathcal{P})$ is exactly $L_{max}$. This is because the longest path from the root to any output node corresponds to the longest pattern, and the height of a rooted tree is the length of the longest path from the root to a leaf (or in this case, to an output node).
Fan-out and Alphabet Size: The fan-out of any node in $K(\mathcal{P})$, which is the number of outgoing edges from that node, is limited by the size of the alphabet $\Sigma$. In fact, the maximum fan-out is at most $|\Sigma|$. This is because for each node, there can be at most one outgoing edge for each character in the alphabet, due to the property that edges emanating from the same node must have distinct labels. In practice, the average fan-out is often much smaller than $|\Sigma|$, especially for nodes closer to the root, as not all characters in the alphabet may be prefixes of the patterns.

Simplified Assumption

For simplicity in initial explanations, it’s often assumed that no pattern in $\mathcal{P}$ is a prefix of another pattern in $\mathcal{P}$. This simplifies the initial discussion and algorithm design. However, this assumption is not a fundamental requirement for Keyword Trees.

When some patterns are prefixes of others, reaching a node in the Keyword Tree might signify the end of multiple patterns. For instance, if $\mathcal{P}= \{\text{he, her, hers}\}$, reaching the node for ‘he’ also implies that ‘he’ is a prefix of ‘her’ and ‘hers’. In such cases, output reporting needs to be handled carefully to ensure all valid matches are reported. Specifically, when we reach an output node, we must report not only the pattern ending at that node but also any other patterns for which the current pattern is a prefix and are also present in $\mathcal{P}$. The core structure and algorithms for Keyword Trees remain valid even without the non-prefix assumption; only the output reporting mechanism needs to be refined to accommodate overlapping pattern matches. We will address this in more detail when discussing pattern matching algorithms using Keyword Trees.

Failure Links

To enhance the efficiency of pattern matching with a Keyword Tree, especially when a mismatch occurs during text traversal, we introduce failure links. Failure links are a crucial component that allows the pattern matching algorithm to avoid redundant comparisons and efficiently shift the search within the Keyword Tree.

Definition and Intuition

For each node $v$ in the keyword tree $K$, the failure link of $v$, denoted as $f(v)$, is a pointer to another node $v'$ in $K$ that satisfies the following conditions:

The primary purpose of failure links is to enable efficient backtracking in the Keyword Tree during text scanning. When a mismatch occurs at a node $v$ (i.e., there is no outgoing edge for the current text character), instead of restarting the search from the root, we follow the failure link from $v$ to $f(v)$. This effectively shifts our focus to the longest known prefix of a pattern that is also a suffix of the string we have just processed. This "shift" ensures that we do not miss any potential pattern matches and maximize the utilization of already matched prefixes.

Consider we are traversing the text and reach a node $v$ in the Keyword Tree, representing a prefix $L(v)$. If the next character in the text does not match any outgoing edge from $v$, a naive approach would be to restart the pattern matching from the root. However, the failure link provides a smarter alternative. By following the failure link from $v$ to $f(v) = v'$, we transition to a new state $v'$ that represents the longest suffix of $L(v)$ which is also a prefix of some pattern. We then attempt to continue matching from $v'$ with the same text character that caused the mismatch at $v$.

Analogy to KMP Failure Function

The concept of failure links in Keyword Trees is directly analogous to the failure function (also known as the prefix function or border array) used in the Knuth-Morris-Pratt (KMP) algorithm for single pattern matching. In KMP, the failure function for a pattern at position $i$ indicates the length of the longest proper prefix of the pattern that is also a suffix of the pattern up to position $i$.

Failure links generalize this idea to a set of patterns and a tree structure. For each node in the Keyword Tree, the failure link points to the node representing the longest proper suffix that is also a prefix of any pattern in the given set. This generalization allows for efficient handling of multiple patterns simultaneously, extending the efficiency of KMP’s optimized shifting to the multiple pattern matching problem.

Illustrative Example

Imagine a Keyword Tree for patterns $\{\text{she}, \text{hers}, \text{he}\}$. Consider a node $v$ in the tree reached by the path spelling "she". The string $L(v) = \text{she}$. We need to find the longest proper suffix of "she" that is also a prefix of $\{\text{she}, \text{hers}, \text{he}\}$. The proper suffixes of "she" are "he", "e", and "" (empty string). Among these, "he" is a prefix of the pattern "he" in our set. Therefore, the failure link for node $v$ (representing "she") should point to the node $v'$ that represents "he".

Illustration of a Failure Link in a Keyword Tree for ({}). The failure link from the node representing “she” points to the node representing “he”.

In Figure 2, the dashed red arrow illustrates the failure link from the node representing "she" to the node representing "he". If, during text processing, we are at the "she" node and encounter a character that does not extend any pattern from "she", we follow this failure link to "he" and continue our search from there. This ensures we consider the possibility of patterns starting with "he" that might occur immediately after a prefix ending in "she".

Algorithm for Keyword Tree Construction with Failure Links

The Keyword Tree and its failure links can be constructed efficiently, typically using a breadth-first search (BFS) approach for failure link computation.

Initialize: Create a root node for the Keyword Tree $K$. Initialize a queue $Q$ for BFS. Insert Patterns: For each pattern $P_i \in \mathcal{P}$, insert it into $K$. Start from the root, traverse $K$ according to the characters of $P_i$. If a character leads to a non-existent edge, create a new node and edge. Mark the node reached after inserting the last character of $P_i$ as an output node, and associate it with pattern $P_i$. Initialize Failure Links for Level 1 Nodes: For all nodes $v$ at depth 1 (children of the root), set their failure link $f(v)$ to the root. Enqueue all depth 1 nodes into $Q$. BFS for Failure Links: While $Q$ is not empty: Dequeue a node $u$ from $Q$. For each character $c \in \Sigma$: Let $v$ be the child of $u$ reached by the edge labeled with $c$ (if it exists). If $v$ exists: Let $fail\_node = f(u)$. While $fail\_node$ is not the root and $fail\_node$ does not have a child with edge label $c$: Set $fail\_node = f(fail\_node)$. If $fail\_node$ has a child $v'$ with edge label $c$, set $f(v) = v'$. Else (if $fail\_node$ is root or no child with label $c$ is found after following failure links), if the root has a child $v''$ with label $c$, set $f(v) = v''$, else set $f(v) = \text{root}$. Enqueue $v$ into $Q$.

Complexity Analysis: The construction of the Keyword Tree and failure links is linear in the total length of the patterns, $O(m)$, where $m = \sum_{P_i \in \mathcal{P}} |P_i|$. Each edge and failure link is processed a constant number of times. The BFS traversal ensures level-by-level processing, and failure link calculation for each node takes amortized constant time by efficiently traversing failure links upwards.

Using Keyword Tree for Pattern Matching

Once the Keyword Tree is constructed and failure links are computed, we can efficiently locate all occurrences of the patterns $\mathcal{P}= \{P_1, P_2, \ldots, P_z\}$ within a given text $T$. The pattern matching algorithm processes the text in a single pass, achieving a time complexity that is linear in the length of the text, plus the number of occurrences.

Initialization: Set $current\_node$ to the root of the Keyword Tree $K$. Text Traversal: For each character $c$ in the input text $T$, in order from the first to the last character: Transition: $current\_node \leftarrow f(current\_node)$ $current\_node \leftarrow \text{child of } current\_node \text{ reached by edge } c$ Output Check: Let $output\_nodes$ be the set of output nodes reachable from $current\_node$ by following zero or more failure links. Report occurrence of pattern $P_i$ ending at the current position in $T$.

Complexity Analysis: The pattern matching algorithm has a time complexity of $O(|T| + \text{number of occurrences})$. Processing each character of the text involves traversing edges in the Keyword Tree and potentially following failure links. In the worst case, for each character, we might traverse up the failure links to the root. However, the total number of failure link traversals during the entire text processing is bounded by the length of the text $|T|$, because each successful match (moving down a tree edge) increases the depth in the tree by one, while each failure link move decreases the depth. The output reporting phase takes time proportional to the number of patterns found at each position. Therefore, the overall time complexity is linear in the text length plus the total number of pattern occurrences reported.

Detailed Explanation of the Algorithm

1. Initialization: We begin at the root of the Keyword Tree. The ‘current_node’ variable keeps track of our current state in the tree as we process the text.

2. Text Traversal and Transition: For each character in the text $T$, we attempt to extend the current prefix match.

Matching Character: If there is an outgoing edge from the ‘current_node’ that matches the current text character $c$, we follow that edge to the next node in the tree. This signifies a successful extension of the current prefix match.
Mismatch and Failure Link Handling: If there is no matching outgoing edge for the current character $c$ from the ‘current_node’, it indicates a mismatch. In this situation, we utilize the failure link. We repeatedly follow the failure link from the ‘current_node’ to its failure node, until we either reach a node that has a child edge matching the character $c$, or we reach the root. This process effectively shifts to the longest proper suffix of the currently matched string that is also a prefix of some pattern. If even from the root, there is no transition for character $c$, we effectively remain at the root for the next character processing.

3. Output Check and Reporting: After each character processing and node transition, we check if the ‘current_node’ is an output node. Being at an output node $v$ means that the string from the root to $v$ is a pattern that has just been found ending at the current position in the text.

Example Walkthrough

Let’s consider the pattern set $\mathcal{P}= \{\text{he, she, hers}\}$ and search in the text $T= \text{ushers}$. Assume we have already constructed the Keyword Tree with failure links for $\mathcal{P}$.

| Step | Text Character | Current Node | Action | Output Patterns Reported | | :—- | :————— | :———– | :——————————————— | :———————– | | 1 | u | root | No ‘u’ edge from root, stay at root | None | | 2 | s | root | ‘s’ edge from root, move to node ‘s’ | None | | 3 | h | ‘s’ | ‘h’ edge from ‘s’, move to node ‘sh’ | None | | 4 | e | ‘sh’ | ‘e’ edge from ‘sh’, move to node ‘she’ (output) | "she" | | 5 | r | ‘she’ | No ‘r’ edge from ‘she’, follow failure link to ‘he’ | None | | | | ‘he’ | No ‘r’ edge from ‘he’, follow failure link to root | None | | | | root | No ‘r’ edge from root, stay at root | None | | 6 | s | root | ‘s’ edge from root, move to node ‘s’ | None |

In this example, the algorithm correctly identifies the occurrence of "she" in "ushers". The use of failure links is implicit in step 5, where upon encountering ‘r’ after ‘she’, the algorithm efficiently shifts the search context to consider potential matches that are suffixes of "she" and prefixes of patterns, without restarting from the beginning of the tree.

This algorithm provides a robust and efficient method for multiple pattern matching, leveraging the Keyword Tree structure and failure links to minimize redundant comparisons and process the text in linear time relative to its length.

Conclusion

The Keyword Tree, when augmented with failure links, provides a highly efficient and elegant solution to the multiple pattern matching problem. This data structure allows for the simultaneous search of a set of patterns within a text in a time complexity that is linear with respect to the length of the text, specifically $O(|T| + \text{number of occurrences})$. The construction of the Keyword Tree itself is also efficient, taking time proportional to the total length of the patterns, $O(m)$, where $m = \sum_{P_i \in \mathcal{P}} |P_i|$.

This approach marks a significant advancement over naive methods that require processing the text multiple times, once for each pattern. The Keyword Tree achieves its efficiency by:

The linear time complexity in text searching makes Keyword Trees particularly valuable in applications where large texts need to be searched against extensive sets of patterns. These applications span across diverse fields, including:

In summary, the Keyword Tree is not just a theoretical construct but a practical and powerful data structure in stringology. Its ability to solve the multiple pattern matching problem efficiently has made it a cornerstone technique in numerous computational applications, especially those dealing with large-scale text and pattern analysis. The combination of its structural elegance and algorithmic efficiency underscores its fundamental importance in computer science and related disciplines.

Exercises

Construct a Keyword Tree for the pattern set $\mathcal{P}= \{\text{he, she, his, hers}\}$. Draw the tree and compute the failure links for each node.
Using the Keyword Tree constructed in Exercise 1, find all occurrences of the patterns in $\mathcal{P}$ in the text $T= \text{ushers}$.
Explain why the algorithm for constructing failure links is linear in the total length of the patterns.
Discuss the space complexity of storing a Keyword Tree. How does it depend on the number of patterns and the alphabet size?
Consider the case where some patterns in $\mathcal{P}$ are prefixes of other patterns. How would you modify the output reporting step in the pattern matching algorithm to ensure all matches are found?

Solutions

Keyword Tree Construction for $\mathcal{P}= \{\text{he, she, his, hers}\}$:

Keyword Tree Structure:

Keyword Tree for (= {}) with Failure Links (red dashed arrows).

Failure Link Calculation:
- Node ‘h’: Longest proper suffix of "h" that is a prefix of any pattern is the empty string (root). So, $f(\text{h}) = \text{root}$.
- Node ‘s’: Similarly, for "s", $f(\text{s}) = \text{root}$.
- Node ‘he’: Longest proper suffix of "he" that is a prefix is "e", but "e" is not a prefix of any pattern in $\mathcal{P}$. Thus, $f(\text{he}) = \text{root}$.
- Node ‘sh’: For "sh", the longest proper suffix that’s a prefix is "s". So, $f(\text{sh}) = \text{s}$.
- Node ‘she’: For "she", proper suffixes are "he", "e". "he" is a prefix of "he, hers, his". So, $f(\text{she}) = \text{he}$.
- Node ‘his’: For "his", proper suffixes are "is", "s". Neither "is" nor "s" is a prefix of "he, she, his, hers" except for the empty string. So, $f(\text{his}) = \text{root}$.
- Node ‘her’: For "her", the longest proper suffix that is also a prefix is "h". So, $f(\text{her}) = \text{h}$.
- Node ‘hers’: For "hers", the longest proper suffix that is also a prefix is "her". So, $f(\text{hers}) = \text{her}$.
The failure links are shown as red dashed arrows in Figure 3.
Pattern Matching in $T= \text{ushers}$:

Starting at the root.
1. Read ‘u’: No ‘u’ edge from root. Stay at root. Current node: root. No output.
2. Read ‘s’: ‘s’ edge from root. Move to node ‘s’. Current node: ‘s’. No output.
3. Read ‘h’: ‘h’ edge from ‘s’. Move to node ‘sh’. Current node: ‘sh’. No output.
4. Read ‘e’: ‘e’ edge from ‘sh’. Move to node ‘she’. Current node: ‘she’. Output node "she". Report "she" at position 4 (ending at ‘e’). Follow failure links from ‘she’ to ‘he’. ‘he’ is also an output node. Report "he" at position 4.
5. Read ‘r’: No ‘r’ edge from ‘she’. Follow failure link to ‘he’. Current node: ‘he’. No ‘r’ edge from ‘he’. Follow failure link to root. Current node: root. No ‘r’ edge from root. Stay at root. Current node: root. No output.
6. Read ‘s’: ‘s’ edge from root. Move to node ‘s’. Current node: ‘s’. No output.
Occurrences found: "she" ending at index 4, "he" ending at index
Linearity of Failure Link Construction:

The algorithm for constructing failure links uses Breadth-First Search (BFS). Each node in the Keyword Tree is enqueued and dequeued exactly once. For each node $u$, and for each character $c$ in the alphabet, we attempt to find the failure link for the child node $v$ reached by character $c$. The crucial part is finding the appropriate failure node by traversing failure links upwards (“While $fail\_node$ is not the root and there is no child of $fail\_node$ with edge label $c$ ...”). Although in the worst case, we might follow a long chain of failure links, the amortized cost is constant.

Consider the depth of a node in the Keyword Tree as the length of the string represented by the path from the root. When we compute a failure link for a node $v$ which is a child of $u$, we start from $f(u)$. The depth of $f(u)$ is strictly less than the depth of $u$. In each step of traversing failure links upwards, the depth decreases. Since the depth is always non-negative, we are guaranteed to reach the root or find a suitable failure node relatively quickly. For every edge in the Keyword Tree, we perform a failure link calculation which, in amortized time, is constant. Since the number of edges is linear in the total length of patterns, the total time complexity for failure link construction is also linear in the total length of the patterns, $O(m)$.
Space Complexity of Keyword Tree:

The space complexity of a Keyword Tree is primarily determined by the number of nodes and edges it contains. In the worst case, where there is no prefix sharing among the patterns, the number of nodes and edges can be proportional to the sum of the lengths of all patterns, $m = \sum_{i=1}^{z} |P_i|$. Specifically:

In summary, the space complexity is primarily influenced by the total length of the patterns and the degree of prefix sharing. A larger alphabet size can increase space usage, especially in naive implementations, but efficient implementations can mitigate this dependency.
Handling Prefix Patterns in Output Reporting:

When some patterns in $\mathcal{P}$ are prefixes of other patterns, simply checking if the ‘current_node’ is an output node is insufficient to report all matches. Consider $\mathcal{P}= \{\text{he, hers, her}\}$. If we are at the node representing "her", it’s an output node for "her". However, "he" is a prefix of "her", and if "he" is also in $\mathcal{P}$, we should also report "he" as a match ending at the same position.

To ensure all matches are found, we need to modify the output reporting step in the pattern matching algorithm. When we reach a ‘current_node’, we should not only check if it is an output node but also traverse its failure links. For each node encountered while following failure links (including the ‘current_node’ itself), we check if it is an output node. If it is, we report all patterns associated with that output node. This is because a failure link points to a node representing a suffix that is also a prefix. If the current node represents a matched string $S$, and its failure link points to a node representing $S'$, then $S'$ is a suffix of $S$ and a prefix of some pattern. If $S'$ is also a pattern (i.e., the node for $S'$ is an output node), then we have found another pattern match ending at the same position.

Therefore, the modified output reporting step in the algorithm should be:

Modified Output Check:

Let $temp\_node = current\_node$ Report occurrence of pattern $P_i$ ending at the current position in $T$. $temp\_node = f(temp\_node)$ Report occurrence of pattern $P_i$ ending at the current position in $T$.

This ensures that we report all patterns that end at the current text position, including those that are prefixes of other matched patterns.

--- title: "Keyword Tree for Exact Pattern Matching" author: "Your Name" date: "2025-02-08" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction In the field of string algorithms, pattern matching is a fundamental problem. We often need to find occurrences of a given pattern within a larger text. A more generalized version of this problem is to search for multiple patterns simultaneously within a text, known as the multiple pattern matching problem. Formally, given a text $T$ and a set of patterns $\mathcal{P}= \{P_1, P_2, \ldots, P_z\}$, the objective is to find all occurrences of all patterns in $\mathcal{P}$ within the text $T$. A naive approach involves iterating through each pattern in $\mathcal{P}$ and using a single pattern matching algorithm (like Knuth-Morris-Pratt or Boyer-Moore) to search for it in $T$. Using an efficient algorithm like KMP, which has a time complexity of $O(|T| + |P_i|)$ for each pattern $P_i$, the total time complexity for searching all $z$ patterns becomes the sum of individual search times. This is approximately $O(\sum_{i=1}^{z} (|T| + |P_i|)) = O(z \cdot |T| + \sum_{i=1}^{z} |P_i|)$. If we denote the sum of pattern lengths $m = \sum_{i=1}^{z} |P_i|$ and consider the maximum pattern length $L_{max} = \max_{i}|P_i|$, the complexity can also be expressed as $O(z \cdot (|T| + L_{max}))$ or roughly $O(z|T| + m)$. While correct, this approach is inefficient, especially when the number of patterns $z$ is large, or when we need to perform this search repeatedly over the same set of patterns but with different texts. The primary inefficiency stems from scanning the text $T$ multiple times, once for each pattern in $\mathcal{P}$. ::: tcolorbox The naive approach to multiple pattern matching suffers from a time complexity that scales linearly with the number of patterns, $z$. For each pattern, the entire text $T$ is potentially scanned. This redundancy becomes significant when searching for a large collection of patterns, leading to substantial computational overhead. The goal is to reduce this dependency on $z$ and process the text more efficiently, ideally in a single pass, regardless of the number of patterns. ::: To overcome this inefficiency, we can employ a data structure known as the **Keyword Tree**, also referred to as a **Trie** or **Prefix Tree**. The Keyword Tree allows us to process the text $T$ only once, simultaneously searching for all patterns in $\mathcal{P}$. The core idea is to combine all patterns into a single tree structure that exploits shared prefixes among the patterns. This shared prefix structure enables a single traversal of the text to identify occurrences of multiple patterns concurrently. This lecture will delve into the concept of Keyword Trees, detailing their structure, construction algorithms, and their application in achieving efficient multiple pattern matching. We will explore how failure links are integrated into the Keyword Tree to optimize the search process, leading to algorithms with significantly improved performance compared to naive methods. # Keyword Tree: Definition and Properties ## Definition ::: tcolorbox Given a set of patterns $\mathcal{P}= \{P_1, P_2, \ldots, P_z\}$ over an alphabet $\Sigma$, the **keyword tree** $K$ (also known as a Trie or Prefix Tree) for $\mathcal{P}$ is a rooted tree characterized by the following properties: 1. **Edge Labels:** Each edge in the tree is labeled with a single character from the alphabet $\Sigma$. 2. **Distinct Outgoing Edge Labels:** For any node in the tree, the labels on the outgoing edges must be distinct. This ensures that for a given character, there is at most one path to follow from any node. 3. **Prefix Representation:** Each node in the keyword tree represents a prefix of one or more patterns in $\mathcal{P}$. Specifically, the string formed by concatenating the edge labels on the path from the root to a node $v$ is a prefix of at least one pattern in the set $\mathcal{P}$. 4. **Pattern End Nodes (Output Nodes):** For each pattern $P_i = c_1c_2\ldots c_{|P_i|} \in \mathcal{P}$, there exists a unique node $v$ in $K$ such that the path from the root to $v$ is spelled out by the characters of $P_i$, i.e., the edges on this path are labeled $c_1, c_2, \ldots, c_{|P_i|}$ in sequence. These nodes, representing the end of a pattern, are designated as **output nodes**. We can mark these nodes or associate them with the pattern(s) that end at that node. Conversely, every path from the root to an output node spells out a pattern in $\mathcal{P}$. ::: ::: tcolorbox Keyword Trees are designed to efficiently store and retrieve a set of strings by leveraging shared prefixes. This structure is particularly effective for tasks involving prefix-based searches and multiple pattern matching. ::: ::: tcolorbox Consider the pattern set $\mathcal{P}= \{\text{potato, tatoo, theater, other}\}$. The Keyword Tree for $\mathcal{P}$ visually represents the patterns, sharing prefixes like 't' in 'tatoo' and 'theater', and 'o' in 'other' and 'potato'. Figure [1](#fig:keyword_tree_example){reference-type="ref" reference="fig:keyword_tree_example"} illustrates such a Keyword Tree. Observe how paths from the root to the output nodes (marked with double circles) spell out the patterns in $\mathcal{P}$. <figure id="fig:keyword_tree_example"> <figcaption>Keyword Tree for $\mathcal{P}= \{\text{potato, tatoo, theater, other}\}$. Output nodes (representing the end of patterns) are marked with double circles.</figcaption> </figure> ::: ## Properties and Dimensions Let $K(\mathcal{P})$ be the keyword tree constructed for a pattern set $\mathcal{P}$. Keyword Trees exhibit several key properties that are crucial for understanding their efficiency and utility. - **Prefix Sharing and Space Efficiency:** The most significant advantage of Keyword Trees is their efficient representation of strings through prefix sharing. Common prefixes among patterns are represented by shared paths from the root. This drastically reduces redundancy, especially when dealing with a large set of patterns that share common initial substrings. In terms of space complexity, this sharing can lead to a substantial reduction compared to storing each pattern individually. - **Path Representation of Prefixes and Patterns:** Every path from the root to any node in $K(\mathcal{P})$ corresponds to a prefix of at least one pattern in $\mathcal{P}$. Specifically, paths from the root to the designated output nodes precisely represent the complete patterns in $\mathcal{P}$. This direct path-to-string correspondence is fundamental to the pattern matching process. - **Number of Arcs (Edges):** The total number of arcs in $K(\mathcal{P})$ is directly related to the total length of all patterns in $\mathcal{P}$ and the extent of prefix sharing. Let $m = \sum_{P_i \in \mathcal{P}} |P_i|$ be the sum of the lengths of all patterns. In the worst-case scenario, where there is no prefix sharing among patterns (e.g., patterns are completely distinct), the number of arcs approaches $m$. However, with significant prefix sharing, the actual number of arcs will be considerably less than $m$. A tighter bound is that the number of arcs is at most $m$, and in practice, it is often much smaller due to prefix compression. - **Number of Nodes:** In any tree, the number of nodes is always one more than the number of arcs (excluding the root if arcs are counted as starting from the root). Therefore, the number of nodes in $K(\mathcal{P})$ is also on the order of $m$ in the worst case, but typically less due to prefix sharing. The root node is the starting point, and each arc leads to a new node, so the node count is closely tied to the arc count. - **Height of the Tree and Maximum Pattern Length:** The height of the Keyword Tree $K(\mathcal{P})$ is determined by the length of the longest pattern in the set $\mathcal{P}$. Let $L_{max} = \max_{P_i \in \mathcal{P}} |P_i|$. Then, the height of $K(\mathcal{P})$ is exactly $L_{max}$. This is because the longest path from the root to any output node corresponds to the longest pattern, and the height of a rooted tree is the length of the longest path from the root to a leaf (or in this case, to an output node). - **Fan-out and Alphabet Size:** The fan-out of any node in $K(\mathcal{P})$, which is the number of outgoing edges from that node, is limited by the size of the alphabet $\Sigma$. In fact, the maximum fan-out is at most $|\Sigma|$. This is because for each node, there can be at most one outgoing edge for each character in the alphabet, due to the property that edges emanating from the same node must have distinct labels. In practice, the average fan-out is often much smaller than $|\Sigma|$, especially for nodes closer to the root, as not all characters in the alphabet may be prefixes of the patterns. ## Simplified Assumption For simplicity in initial explanations, it's often assumed that no pattern in $\mathcal{P}$ is a prefix of another pattern in $\mathcal{P}$. This simplifies the initial discussion and algorithm design. However, this assumption is not a fundamental requirement for Keyword Trees. When some patterns are prefixes of others, reaching a node in the Keyword Tree might signify the end of multiple patterns. For instance, if $\mathcal{P}= \{\text{he, her, hers}\}$, reaching the node for 'he' also implies that 'he' is a prefix of 'her' and 'hers'. In such cases, output reporting needs to be handled carefully to ensure all valid matches are reported. Specifically, when we reach an output node, we must report not only the pattern ending at that node but also any other patterns for which the current pattern is a prefix and are also present in $\mathcal{P}$. The core structure and algorithms for Keyword Trees remain valid even without the non-prefix assumption; only the output reporting mechanism needs to be refined to accommodate overlapping pattern matches. We will address this in more detail when discussing pattern matching algorithms using Keyword Trees. # Failure Links To enhance the efficiency of pattern matching with a Keyword Tree, especially when a mismatch occurs during text traversal, we introduce **failure links**. Failure links are a crucial component that allows the pattern matching algorithm to avoid redundant comparisons and efficiently shift the search within the Keyword Tree. ## Definition and Intuition ::: tcolorbox For each node $v$ in the keyword tree $K$, the **failure link** of $v$, denoted as $f(v)$, is a pointer to another node $v'$ in $K$ that satisfies the following conditions: ::: ::: tcolorbox The primary purpose of failure links is to enable efficient backtracking in the Keyword Tree during text scanning. When a mismatch occurs at a node $v$ (i.e., there is no outgoing edge for the current text character), instead of restarting the search from the root, we follow the failure link from $v$ to $f(v)$. This effectively shifts our focus to the longest known prefix of a pattern that is also a suffix of the string we have just processed. This \"shift\" ensures that we do not miss any potential pattern matches and maximize the utilization of already matched prefixes. ::: Consider we are traversing the text and reach a node $v$ in the Keyword Tree, representing a prefix $L(v)$. If the next character in the text does not match any outgoing edge from $v$, a naive approach would be to restart the pattern matching from the root. However, the failure link provides a smarter alternative. By following the failure link from $v$ to $f(v) = v'$, we transition to a new state $v'$ that represents the longest suffix of $L(v)$ which is also a prefix of some pattern. We then attempt to continue matching from $v'$ with the same text character that caused the mismatch at $v$. ## Analogy to KMP Failure Function The concept of failure links in Keyword Trees is directly analogous to the failure function (also known as the prefix function or border array) used in the Knuth-Morris-Pratt (KMP) algorithm for single pattern matching. In KMP, the failure function for a pattern at position $i$ indicates the length of the longest proper prefix of the pattern that is also a suffix of the pattern up to position $i$. Failure links generalize this idea to a set of patterns and a tree structure. For each node in the Keyword Tree, the failure link points to the node representing the longest proper suffix that is also a prefix of any pattern in the given set. This generalization allows for efficient handling of multiple patterns simultaneously, extending the efficiency of KMP's optimized shifting to the multiple pattern matching problem. ## Illustrative Example Imagine a Keyword Tree for patterns $\{\text{she}, \text{hers}, \text{he}\}$. Consider a node $v$ in the tree reached by the path spelling \"she\". The string $L(v) = \text{she}$. We need to find the longest proper suffix of \"she\" that is also a prefix of $\{\text{she}, \text{hers}, \text{he}\}$. The proper suffixes of \"she\" are \"he\", \"e\", and \"\" (empty string). Among these, \"he\" is a prefix of the pattern \"he\" in our set. Therefore, the failure link for node $v$ (representing \"she\") should point to the node $v'$ that represents \"he\". <figure id="fig:failure_link_example"> <figcaption>Illustration of a Failure Link in a Keyword Tree for $\{\text{he, she, hers}\}$. The failure link from the node representing "she" points to the node representing "he".</figcaption> </figure> In Figure [2](#fig:failure_link_example){reference-type="ref" reference="fig:failure_link_example"}, the dashed red arrow illustrates the failure link from the node representing \"she\" to the node representing \"he\". If, during text processing, we are at the \"she\" node and encounter a character that does not extend any pattern from \"she\", we follow this failure link to \"he\" and continue our search from there. This ensures we consider the possibility of patterns starting with \"he\" that might occur immediately after a prefix ending in \"she\". # Algorithm for Keyword Tree Construction with Failure Links The Keyword Tree and its failure links can be constructed efficiently, typically using a breadth-first search (BFS) approach for failure link computation. ::::: tcolorbox :::: algorithm ::: algorithmic **Initialize:** Create a root node for the Keyword Tree $K$. Initialize a queue $Q$ for BFS. **Insert Patterns:** For each pattern $P_i \in \mathcal{P}$, insert it into $K$. Start from the root, traverse $K$ according to the characters of $P_i$. If a character leads to a non-existent edge, create a new node and edge. Mark the node reached after inserting the last character of $P_i$ as an output node, and associate it with pattern $P_i$. **Initialize Failure Links for Level 1 Nodes:** For all nodes $v$ at depth 1 (children of the root), set their failure link $f(v)$ to the root. Enqueue all depth 1 nodes into $Q$. **BFS for Failure Links:** While $Q$ is not empty: Dequeue a node $u$ from $Q$. For each character $c \in \Sigma$: Let $v$ be the child of $u$ reached by the edge labeled with $c$ (if it exists). If $v$ exists: Let $fail\_node = f(u)$. While $fail\_node$ is not the root and $fail\_node$ does not have a child with edge label $c$: Set $fail\_node = f(fail\_node)$. If $fail\_node$ has a child $v'$ with edge label $c$, set $f(v) = v'$. Else (if $fail\_node$ is root or no child with label $c$ is found after following failure links), if the root has a child $v''$ with label $c$, set $f(v) = v''$, else set $f(v) = \text{root}$. Enqueue $v$ into $Q$. ::: :::: ::::: **Complexity Analysis:** The construction of the Keyword Tree and failure links is linear in the total length of the patterns, $O(m)$, where $m = \sum_{P_i \in \mathcal{P}} |P_i|$. Each edge and failure link is processed a constant number of times. The BFS traversal ensures level-by-level processing, and failure link calculation for each node takes amortized constant time by efficiently traversing failure links upwards. # Using Keyword Tree for Pattern Matching Once the Keyword Tree is constructed and failure links are computed, we can efficiently locate all occurrences of the patterns $\mathcal{P}= \{P_1, P_2, \ldots, P_z\}$ within a given text $T$. The pattern matching algorithm processes the text in a single pass, achieving a time complexity that is linear in the length of the text, plus the number of occurrences. ::::: tcolorbox :::: algorithm ::: algorithmic **Initialization:** Set $current\_node$ to the root of the Keyword Tree $K$. **Text Traversal:** For each character $c$ in the input text $T$, in order from the first to the last character: **Transition:** $current\_node \leftarrow f(current\_node)$ $current\_node \leftarrow \text{child of } current\_node \text{ reached by edge } c$ **Output Check:** Let $output\_nodes$ be the set of output nodes reachable from $current\_node$ by following zero or more failure links. Report occurrence of pattern $P_i$ ending at the current position in $T$. ::: **Complexity Analysis:** The pattern matching algorithm has a time complexity of $O(|T| + \text{number of occurrences})$. Processing each character of the text involves traversing edges in the Keyword Tree and potentially following failure links. In the worst case, for each character, we might traverse up the failure links to the root. However, the total number of failure link traversals during the entire text processing is bounded by the length of the text $|T|$, because each successful match (moving down a tree edge) increases the depth in the tree by one, while each failure link move decreases the depth. The output reporting phase takes time proportional to the number of patterns found at each position. Therefore, the overall time complexity is linear in the text length plus the total number of pattern occurrences reported. :::: ::::: ## Detailed Explanation of the Algorithm 1\. **Initialization:** We begin at the root of the Keyword Tree. The 'current_node' variable keeps track of our current state in the tree as we process the text. 2\. **Text Traversal and Transition:** For each character in the text $T$, we attempt to extend the current prefix match. - **Matching Character:** If there is an outgoing edge from the 'current_node' that matches the current text character $c$, we follow that edge to the next node in the tree. This signifies a successful extension of the current prefix match. - **Mismatch and Failure Link Handling:** If there is no matching outgoing edge for the current character $c$ from the 'current_node', it indicates a mismatch. In this situation, we utilize the failure link. We repeatedly follow the failure link from the 'current_node' to its failure node, until we either reach a node that has a child edge matching the character $c$, or we reach the root. This process effectively shifts to the longest proper suffix of the currently matched string that is also a prefix of some pattern. If even from the root, there is no transition for character $c$, we effectively remain at the root for the next character processing. 3\. **Output Check and Reporting:** After each character processing and node transition, we check if the 'current_node' is an output node. Being at an output node $v$ means that the string from the root to $v$ is a pattern that has just been found ending at the current position in the text. ## Example Walkthrough Let's consider the pattern set $\mathcal{P}= \{\text{he, she, hers}\}$ and search in the text $T= \text{ushers}$. Assume we have already constructed the Keyword Tree with failure links for $\mathcal{P}$. \| Step \| Text Character \| Current Node \| Action \| Output Patterns Reported \| \| :---- \| :--------------- \| :----------- \| :--------------------------------------------- \| :----------------------- \| \| 1 \| u \| root \| No 'u' edge from root, stay at root \| None \| \| 2 \| s \| root \| 's' edge from root, move to node 's' \| None \| \| 3 \| h \| 's' \| 'h' edge from 's', move to node 'sh' \| None \| \| 4 \| e \| 'sh' \| 'e' edge from 'sh', move to node 'she' (output) \| \"she\" \| \| 5 \| r \| 'she' \| No 'r' edge from 'she', follow failure link to 'he' \| None \| \| \| \| 'he' \| No 'r' edge from 'he', follow failure link to root \| None \| \| \| \| root \| No 'r' edge from root, stay at root \| None \| \| 6 \| s \| root \| 's' edge from root, move to node 's' \| None \| In this example, the algorithm correctly identifies the occurrence of \"she\" in \"ushers\". The use of failure links is implicit in step 5, where upon encountering 'r' after 'she', the algorithm efficiently shifts the search context to consider potential matches that are suffixes of \"she\" and prefixes of patterns, without restarting from the beginning of the tree. This algorithm provides a robust and efficient method for multiple pattern matching, leveraging the Keyword Tree structure and failure links to minimize redundant comparisons and process the text in linear time relative to its length. # Conclusion The Keyword Tree, when augmented with failure links, provides a highly efficient and elegant solution to the multiple pattern matching problem. This data structure allows for the simultaneous search of a set of patterns within a text in a time complexity that is linear with respect to the length of the text, specifically $O(|T| + \text{number of occurrences})$. The construction of the Keyword Tree itself is also efficient, taking time proportional to the total length of the patterns, $O(m)$, where $m = \sum_{P_i \in \mathcal{P}} |P_i|$. This approach marks a significant advancement over naive methods that require processing the text multiple times, once for each pattern. The Keyword Tree achieves its efficiency by: The linear time complexity in text searching makes Keyword Trees particularly valuable in applications where large texts need to be searched against extensive sets of patterns. These applications span across diverse fields, including: In summary, the Keyword Tree is not just a theoretical construct but a practical and powerful data structure in stringology. Its ability to solve the multiple pattern matching problem efficiently has made it a cornerstone technique in numerous computational applications, especially those dealing with large-scale text and pattern analysis. The combination of its structural elegance and algorithmic efficiency underscores its fundamental importance in computer science and related disciplines. # Exercises 1. Construct a Keyword Tree for the pattern set $\mathcal{P}= \{\text{he, she, his, hers}\}$. Draw the tree and compute the failure links for each node. 2. Using the Keyword Tree constructed in Exercise 1, find all occurrences of the patterns in $\mathcal{P}$ in the text $T= \text{ushers}$. 3. Explain why the algorithm for constructing failure links is linear in the total length of the patterns. 4. Discuss the space complexity of storing a Keyword Tree. How does it depend on the number of patterns and the alphabet size? 5. Consider the case where some patterns in $\mathcal{P}$ are prefixes of other patterns. How would you modify the output reporting step in the pattern matching algorithm to ensure all matches are found? ## Solutions {#solutions .unnumbered} 1. **Keyword Tree Construction for $\mathcal{P}= \{\text{he, she, his, hers}\}$:** **Keyword Tree Structure:** <figure id="fig:kw_tree_he_she_his_hers"> <figcaption>Keyword Tree for $\mathcal{P}= \{\text{he, she, his, hers}\}$ with Failure Links (red dashed arrows).</figcaption> </figure> **Failure Link Calculation:** - **Node 'h':** Longest proper suffix of \"h\" that is a prefix of any pattern is the empty string (root). So, $f(\text{h}) = \text{root}$. - **Node 's':** Similarly, for \"s\", $f(\text{s}) = \text{root}$. - **Node 'he':** Longest proper suffix of \"he\" that is a prefix is \"e\", but \"e\" is not a prefix of any pattern in $\mathcal{P}$. Thus, $f(\text{he}) = \text{root}$. - **Node 'sh':** For \"sh\", the longest proper suffix that's a prefix is \"s\". So, $f(\text{sh}) = \text{s}$. - **Node 'she':** For \"she\", proper suffixes are \"he\", \"e\". \"he\" is a prefix of \"he, hers, his\". So, $f(\text{she}) = \text{he}$. - **Node 'his':** For \"his\", proper suffixes are \"is\", \"s\". Neither \"is\" nor \"s\" is a prefix of \"he, she, his, hers\" except for the empty string. So, $f(\text{his}) = \text{root}$. - **Node 'her':** For \"her\", the longest proper suffix that is also a prefix is \"h\". So, $f(\text{her}) = \text{h}$. - **Node 'hers':** For \"hers\", the longest proper suffix that is also a prefix is \"her\". So, $f(\text{hers}) = \text{her}$. The failure links are shown as red dashed arrows in Figure [3](#fig:kw_tree_he_she_his_hers){reference-type="ref" reference="fig:kw_tree_he_she_his_hers"}. 2. **Pattern Matching in $T= \text{ushers}$:** Starting at the root. 1. Read 'u': No 'u' edge from root. Stay at root. Current node: root. No output. 2. Read 's': 's' edge from root. Move to node 's'. Current node: 's'. No output. 3. Read 'h': 'h' edge from 's'. Move to node 'sh'. Current node: 'sh'. No output. 4. Read 'e': 'e' edge from 'sh'. Move to node 'she'. Current node: 'she'. Output node \"she\". Report \"she\" at position 4 (ending at 'e'). Follow failure links from 'she' to 'he'. 'he' is also an output node. Report \"he\" at position 4. 5. Read 'r': No 'r' edge from 'she'. Follow failure link to 'he'. Current node: 'he'. No 'r' edge from 'he'. Follow failure link to root. Current node: root. No 'r' edge from root. Stay at root. Current node: root. No output. 6. Read 's': 's' edge from root. Move to node 's'. Current node: 's'. No output. Occurrences found: \"she\" ending at index 4, \"he\" ending at index 4. 3. **Linearity of Failure Link Construction:** The algorithm for constructing failure links uses Breadth-First Search (BFS). Each node in the Keyword Tree is enqueued and dequeued exactly once. For each node $u$, and for each character $c$ in the alphabet, we attempt to find the failure link for the child node $v$ reached by character $c$. The crucial part is finding the appropriate failure node by traversing failure links upwards ("While $fail\_node$ is not the root and there is no child of $fail\_node$ with edge label $c$ \..."). Although in the worst case, we might follow a long chain of failure links, the amortized cost is constant. Consider the depth of a node in the Keyword Tree as the length of the string represented by the path from the root. When we compute a failure link for a node $v$ which is a child of $u$, we start from $f(u)$. The depth of $f(u)$ is strictly less than the depth of $u$. In each step of traversing failure links upwards, the depth decreases. Since the depth is always non-negative, we are guaranteed to reach the root or find a suitable failure node relatively quickly. For every edge in the Keyword Tree, we perform a failure link calculation which, in amortized time, is constant. Since the number of edges is linear in the total length of patterns, the total time complexity for failure link construction is also linear in the total length of the patterns, $O(m)$. 4. **Space Complexity of Keyword Tree:** The space complexity of a Keyword Tree is primarily determined by the number of nodes and edges it contains. In the worst case, where there is no prefix sharing among the patterns, the number of nodes and edges can be proportional to the sum of the lengths of all patterns, $m = \sum_{i=1}^{z} |P_i|$. Specifically: In summary, the space complexity is primarily influenced by the total length of the patterns and the degree of prefix sharing. A larger alphabet size can increase space usage, especially in naive implementations, but efficient implementations can mitigate this dependency. 5. **Handling Prefix Patterns in Output Reporting:** When some patterns in $\mathcal{P}$ are prefixes of other patterns, simply checking if the 'current_node' is an output node is insufficient to report all matches. Consider $\mathcal{P}= \{\text{he, hers, her}\}$. If we are at the node representing \"her\", it's an output node for \"her\". However, \"he\" is a prefix of \"her\", and if \"he\" is also in $\mathcal{P}$, we should also report \"he\" as a match ending at the same position. To ensure all matches are found, we need to modify the output reporting step in the pattern matching algorithm. When we reach a 'current_node', we should not only check if it is an output node but also traverse its failure links. For each node encountered while following failure links (including the 'current_node' itself), we check if it is an output node. If it is, we report all patterns associated with that output node. This is because a failure link points to a node representing a suffix that is also a prefix. If the current node represents a matched string $S$, and its failure link points to a node representing $S'$, then $S'$ is a suffix of $S$ and a prefix of some pattern. If $S'$ is also a pattern (i.e., the node for $S'$ is an output node), then we have found another pattern match ending at the same position. Therefore, the modified output reporting step in the algorithm should be: **Modified Output Check:** ::: algorithmic Let $temp\_node = current\_node$ Report occurrence of pattern $P_i$ ending at the current position in $T$. $temp\_node = f(temp\_node)$ Report occurrence of pattern $P_i$ ending at the current position in $T$. ::: This ensures that we report all patterns that end at the current text position, including those that are prefixes of other matched patterns.