Suffix Trees and Applications

Author

Your Name

Published

February 8, 2025

Introduction

In this lecture, we delve into the concept of suffix trees, a pivotal data structure in string algorithms. Suffix trees are renowned for their efficiency and versatility, finding extensive applications in diverse fields such as bioinformatics, text processing, and data compression. Our exploration will commence with a seemingly straightforward problem: identifying the longest repeated substring within a given string. This problem serves as an excellent entry point to appreciate the power and utility of suffix trees.

Consider the following problem:

Problem: Given a string \(T\), efficiently determine the longest substring that appears at least twice within \(T\).

As hinted in our notes, suffix trees offer an elegant and efficient approach to solve this problem, along with a multitude of other string-related challenges. The assertion that "with a suffix tree of \(T\), we can easily solve the problem" underscores the significance of this data structure. In this lecture, we aim to elucidate why this statement holds true and to thoroughly understand the construction and application of suffix trees.

Suffix Tries and Suffix Trees

Suffix Trie: \(\mathcal{K}(T)\)

Let \(T\) be a string of length \(m\). A suffix of \(T\) is any substring that starts at some position \(j\) and extends to the end of \(T\). We denote the \(j\)-th suffix of \(T\) as \(T_{j} = T[j..m]\).

Example 1.

For \(T= \text{"banana"}\), the suffixes are: \[\begin{array}{ll}T_{1} = \text{"banana"}, & T_{4} = \text{"ana"}, \\T_{2} = \text{"anana"}, & T_{5} = \text{"na"}, \\T_{3} = \text{"nana"}, & T_{6} = \text{"a"}.\end{array}\]

The suffix trie is a fundamental data structure for string processing. It is essentially a trie that stores all suffixes of a given text, also referred to in the notes as "Keyword tree of the text!".

Definition 1 (Suffix Trie).

The Suffix Trie of \(T\), denoted as \(\mathcal{K}(T)\), is a trie that stores all suffixes of \(T\). Each path from the root to a leaf in the suffix trie represents a suffix of \(T\).

To construct \(\mathcal{K}(T)\), we insert all suffixes of \(T\) into a trie. For example, to build the suffix trie for "banana", we would insert "banana", "anana", "nana", "ana", "na", and "a" into a trie.

Size of Suffix Trie

A crucial question is the space complexity of the suffix trie. As indicated in the notes, "Dimension of \(\mathcal{K}(T)\) can be quadratic". Let’s examine the worst-case scenario for the size of \(\mathcal{K}(T)\).

Example 2 (Worst-case size).

Consider a string like \(T= \text{"a}^n\text{b}^n"\). The suffixes are \(\text{"a}^n\text{b}^n"\), \(\text{"a}^{n-1}\text{b}^n"\), ..., \(\text{"ab}^n"\), \(\text{"b}^n"\), \(\text{"b}^{n-1}"\), ..., \(\text{"b"}\), and the empty string. When we construct a trie from these suffixes, in the worst case, each suffix may contribute a path whose length is nearly the length of the suffix itself. For a string of length \(m\), the sum of the lengths of all suffixes is approximately \(m + (m-1) + ... + 1 = \frac{m(m+1)}{2} = O(m^2)\). Thus, the size of the suffix trie, in terms of the number of nodes and edges, can indeed be quadratic in the length of the string in the worst-case scenario. For instance, in the string "aaaaaaaaaa", each suffix adds a new branch for each character, leading to a quadratic number of nodes. Another example is a string consisting of \(n\) distinct characters, like "abcdefg...". Each suffix will branch off from the root in a unique path, leading to a trie with \(O(m^2)\) nodes and edges.

This quadratic space complexity can be prohibitive for long texts. To overcome this, we introduce the concept of a compacted suffix trie, or suffix tree.

Compacted Suffix Trie and Suffix Tree

To mitigate the potential quadratic size of suffix tries, we use a more space-efficient representation: the compacted suffix trie, commonly known as the suffix tree. The core idea is to eliminate non-branching paths in the suffix trie.

Definition 2 (Suffix Tree).

The Suffix Tree of \(T\), denoted as \(\mathcal{ST}(T)\), is a compacted trie of all suffixes of \(T\). It is derived from the suffix trie by applying the following compaction rule: replace any maximal path of internal nodes with degree two (i.e., each node in the path, except possibly the first and last, has exactly one child) with a single edge. The edge is labeled with the concatenation of the labels of the edges comprising the path. In a suffix tree, each internal node (except possibly the root) has at least two children, and each edge is labeled with a non-empty substring of \(T\). Leaves in the suffix tree correspond to the suffixes of \(T\).

As indicated in the notes, we aim to "replace sequence of nodes with a single child: can be replaced by a single arc?". The challenge then becomes "Problem: the labels". Instead of labeling edges with single characters, we label them with strings (substrings of the original text). To efficiently represent these substring labels, we use pairs of integers \([i, j]\) that correspond to the substring \(T[i..j]\). This approach avoids redundant storage of substrings; we only need to store the original text \(T\) once, and the edges in the suffix tree contain references to substrings within \(T\). As the notes emphasize: "MA devo tenere in memoria anche il testo" (BUT I also have to keep the text in memory) and "My data structure: K’(T)-compacted + T".

Compaction Process in Detail

The compaction process transforms a suffix trie into a suffix tree by iteratively identifying and compressing non-branching paths. Here are the detailed steps for compaction:

1. **Identify Non-branching Paths:** Start from the root of the suffix trie. Traverse downwards. If a node \(v\) (other than a leaf) has only one child \(u\), it is the start of a non-branching path. 2. **Determine Maximal Path:** Follow the path from \(v\) through its unique children until a node is reached that is either a leaf or a branching node (has more than one child) or the root is reached (in some cases, though root is typically branching in suffix tries unless only one suffix is present). Let’s say this path is \(v \rightarrow n_1 \rightarrow n_2 \rightarrow \dots \rightarrow u\), where \(v\) is the starting node, and \(u\) is the ending node which is either a leaf or a branching node, or the path ends because the next node would be branching. 3. **Concatenate Edge Labels:** Collect the edge labels along the path \(v \rightarrow n_1, n_1 \rightarrow n_2, \dots, n_{k-1} \rightarrow u\). Concatenate these labels in order to form a new string label. 4. **Replace Path with Single Edge:** Remove the intermediate nodes \(n_1, n_2, \dots, n_{k-1}\) and replace the entire path from \(v\) to \(u\) with a single edge from \(v\) to \(u\). Label this new edge with the concatenated string obtained in the previous step. If \(u\) was originally a node in the suffix trie, it remains a node in the suffix tree. If \(u\) was reached because we ran out of single-child nodes, then the path ends at the last single-child node, and the edge goes to that node. 5. **Repeat:** Repeat steps 1-4 until no more non-branching paths of internal nodes with out-degree one exist in the trie.

After compaction, every internal node in the suffix tree (except possibly the root) will be a branching node, having at least two children. The edges will be labeled with substrings of the original text \(T\).

Illustration of path compaction in a suffix trie to form a suffix tree. Paths of single-child nodes are compressed into single edges with string labels. For simplicity, only a part of a hypothetical suffix trie is shown.

Figure 1 illustrates a simplified example of path compaction. The paths of single-child nodes are replaced by single edges with concatenated string labels, resulting in a more compact tree structure.

Observation: Substrings and Suffixes

A fundamental observation that highlights the power of suffix trees is the relationship between substrings and suffixes.

Proposition 1.

Description: This proposition states a fundamental relationship between substrings and suffixes, showing that any substring of a text is a prefix of some suffix of the same text. Every substring of \(T\) is a prefix of a suffix of \(T\).

Proof. Proof. Let \(S\) be any substring of \(T\). Then \(S\) occurs in \(T\) starting at some position \(i\) and ending at position \(j\). Thus, \(S = T[i..j]\). Consider the suffix of \(T\) starting at position \(i\), which is \(T_{i} = T[i..m]\). Since \(S = T[i..j]\) and \(T_{i} = T[i..m]\) with \(j \leq m\), it is clear that \(S\) is a prefix of \(T_{i}\). ◻

This observation is crucial because it implies that every substring of \(T\) is represented by a path from the root in the suffix tree \(\mathcal{ST}(T)\). This path may not necessarily end at a leaf, but it corresponds to reading the characters of the substring along the edges of the tree starting from the root.

Using Suffix Tree for Pattern Matching

Suffix trees provide an efficient way to solve the pattern matching problem. Given a pattern \(P\) and a text \(T\), we want to find all occurrences of \(P\) in \(T\). Using \(\mathcal{ST}(T)\), we can perform this search efficiently.

To find all occurrences of a pattern \(P\) in \(T\), we traverse the suffix tree \(\mathcal{ST}(T)\) starting from the root. We attempt to match the characters of \(P\) sequentially along a path in the tree.

1. Start at the root of \(\mathcal{ST}(T)\). 2. For each character in \(P\), starting from the first character, try to follow an edge from the current node whose label starts with this character. 3. If a matching edge is found, advance along the edge, matching as many characters as possible from the edge label with the remaining characters of \(P\). 4. If all characters of \(P\) are matched successfully, we reach a node \(v\) in the suffix tree (it could be in the middle of an edge or at a node). The occurrences of \(P\) in \(T\) are precisely the starting positions of the suffixes represented by the leaves in the subtree rooted at \(v\). 5. To find all starting positions, traverse the subtree rooted at \(v\) and collect the starting indices of all suffixes represented by the leaves in this subtree.

The number of leaves in the subtree rooted at \(v\) corresponds to the number of occurrences of \(P\) in \(T\). Each leaf in the suffix tree typically stores the starting position of the suffix it represents.

Algorithm 1.

Input: Suffix Tree \(\mathcal{ST}(T)\) of text \(T\), pattern \(P\)
Output: Starting positions of all occurrences of \(P\) in \(T\)

\(v \leftarrow \text{root of } \mathcal{ST}(T)\) \(i \leftarrow 1\) \(m \leftarrow |P|\) while \(i \leq m\) do Find an edge \((v, u)\) starting from node \(v\) such that the edge label starts with \(P[i]\) if no such edge exists then return "Pattern not found" else Let the edge label be \(T[start..end]\) \(j \leftarrow 1\) while \(i \leq m\) and \(start + j - 1 \leq end\) and \(P[i] = T[start + j - 1]\) do \(i \leftarrow i + 1\) \(j \leftarrow j + 1\) if \(i > m\) then Let \(v_{match}\) be the node reached after matching \(P\) Collect starting positions from all leaves in the subtree rooted at \(v_{match}\) return Starting positions \(v \leftarrow u\) return "Pattern not found"

Complexity Analysis: The time complexity of pattern matching using a suffix tree is \(O(|P| + k)\), where \(|P|\) is the length of the pattern and \(k\) is the number of occurrences of \(P\) in \(T\). The traversal down the tree takes at most \(O(|P|)\) time because in a suffix tree, we follow a path corresponding to the pattern length. Reporting all occurrences involves traversing the subtree, which takes \(O(k)\) time if there are \(k\) occurrences, as there are \(k\) leaves in the subtree. This is considered optimal because it is linear in the pattern length plus the output size.

Strategy for Suffix Tree Construction

Efficiently constructing a suffix tree is crucial for its practical applications. Remarkably, suffix trees for a string \(T\) of length \(m\) can be constructed in \(O(m)\) time. Several linear-time algorithms exist for suffix tree construction, including Ukkonen’s algorithm and McCreight’s algorithm.

Remark. Remark 1.

Suffix trees for a string \(T\) of length \(m\) can be constructed in \(O(m)\) time. Algorithms like Ukkonen’s and McCreight’s achieve this linear time complexity, making suffix trees highly efficient for practical applications. These algorithms are typically based on incremental construction methods.

These algorithms are sophisticated and build the suffix tree incrementally. Ukkonen’s algorithm, for instance, is an online algorithm that constructs the suffix tree by processing the text character by character from left to right. McCreight’s algorithm is another linear-time algorithm that uses a different approach, processing suffixes in a specific order to achieve linear time complexity.

While the detailed explanation of these algorithms is beyond the scope of this lecture, it is important to recognize that linear-time construction is achievable, making suffix trees a highly efficient data structure for various string processing tasks.

In practice, the choice of construction algorithm might depend on factors such as implementation complexity and constant factors in the time complexity, but the theoretical linear time bound underscores the efficiency of suffix trees.

Conclusion

Suffix trees stand as a cornerstone data structure in string processing, renowned for their power and efficiency. They provide a remarkably effective representation of all substrings within a text, thereby enabling swift and elegant solutions to a wide array of string-related problems. Their versatility makes them indispensable in numerous applications, spanning from fundamental string operations to complex computational tasks in diverse fields.

Key applications of suffix trees include:

  • Pattern Matching: Efficiently finding all occurrences of a given pattern within a text, as detailed in [alg:pattern_matching].

  • Longest Repeated Substring Problem: Solving the problem of finding the longest substring that appears at least twice in a text. This problem, which served as our initial motivation, is readily solved using suffix trees by finding the deepest internal node.

  • Longest Common Substring Problem: Identifying the longest substring common to two or more strings. Suffix trees can be generalized to handle multiple strings, making this problem solvable in linear time.

  • Bioinformatics Applications: Extensive use in sequence analysis, including genome alignment, finding motifs in DNA sequences, and phylogenetic tree construction.

  • Text Compression and Indexing: Serving as a basis for advanced text compression techniques and efficient text indexing for rapid information retrieval.

  • String Algorithms and Data Structures Research: As a fundamental data structure, suffix trees are often used as a building block or comparison point in the development of new string algorithms and data structures.

A significant advantage of suffix trees over suffix tries is their space efficiency. While suffix tries can exhibit quadratic space complexity in the worst case, suffix trees achieve linear space complexity, scaling linearly with the length of the input text. This compaction, combined with the existence of linear-time construction algorithms like Ukkonen’s and McCreight’s algorithms, solidifies suffix trees as an invaluable and practical tool in stringology, bioinformatics, and various areas of computer science where efficient string manipulation is paramount. Their ability to handle complex string queries with optimal or near-optimal time complexity makes them a preferred choice for many advanced applications.

Exercises

  1. Construct the suffix trie and the suffix tree for the string "banana$", where ‘$’ is a terminator symbol not present in the alphabet. Explain the steps for compaction in detail.

  2. Using the suffix tree of "banana$" constructed in Exercise 1, find all occurrences of the pattern "ana". Illustrate the tree traversal and identify the relevant subtree.

  3. Explain in detail how you would use a suffix tree to solve the longest repeated substring problem in a string \(T\).

  4. Compare and contrast the space complexity of a suffix trie and a suffix tree. In what specific scenarios is a suffix tree significantly more space-efficient than a suffix trie, and why?

  5. Research and briefly describe one linear-time algorithm for constructing suffix trees (e.g., Ukkonen’s algorithm or McCreight’s algorithm). Focus on the high-level approach and key ideas rather than detailed steps.