On-Line Construction of Suffix Trees

Author

Your Name

Published

February 8, 2025

Introduction

Suffix trees are a cornerstone data structure in string algorithms, finding extensive use in diverse applications such as pattern matching, text indexing, and bioinformatics. These trie-like structures efficiently represent all suffixes of a given string, enabling rapid solutions to complex string problems. For instance, in bioinformatics, suffix trees are invaluable for tasks like genome analysis and sequence alignment, where identifying repeated patterns or motifs is crucial. In text indexing, they allow for efficient substring searches, forming the basis of many search engines and text editors. While linear-time algorithms for suffix tree construction have been known for some time, notably Weiner’s algorithm in 1973 and McCreight’s algorithm in 1976, they are often perceived as intricate and challenging to implement and understand due to their intricate logic and reliance on non-intuitive construction steps.

This lecture delves into the work of E. Ukkonen, who introduced a remarkably elegant and conceptually accessible on-line algorithm for linear-time suffix tree construction. Ukkonen’s algorithm distinguishes itself by processing the input string character by character from left to right. At each step, it maintains the suffix tree of the prefix of the string processed so far, ensuring that the suffix tree is always "on-line" and ready for querying. This on-line property is particularly advantageous in scenarios where data arrives sequentially or in streaming fashion. Furthermore, the algorithm’s clarity stems from its intuitive approach, building upon a simpler, albeit quadratic-time, suffix trie construction method. Ukkonen’s contribution is thus highly significant, offering both practical efficiency and pedagogical clarity to the field of suffix tree construction.

In this lecture, we will systematically explore the following key aspects:

  • Suffix Tries and Their Construction: We will begin by formally defining suffix tries and detailing their step-by-step construction process. This will lay the groundwork for understanding suffix trees by first examining a simpler, related structure.

  • Suffix Trees: Space-Efficient Representation: We will then transition to suffix trees, highlighting them as a space-optimized alternative to suffix tries. We will discuss the critical differences in representation that lead to linear space complexity, contrasting them with the potentially quadratic space usage of suffix tries.

  • Ukkonen’s On-line Algorithm: The core of this lecture will be a detailed, in-depth explanation of Ukkonen’s on-line suffix tree construction algorithm. We will dissect the algorithm’s logic, emphasizing its on-line nature and linear time complexity.

  • Key Procedures: update, test-and-split, and canonize: Finally, we will scrutinize the three fundamental procedures that underpin Ukkonen’s algorithm: update, test-and-split, and canonize. Understanding these procedures is essential for grasping the inner workings and efficiency of the overall algorithm.

Suffix Tries

Definitions

Definition 1 (String, Substring, Suffix). Let \(T = t_1t_2...t_n\) be a string over an alphabet \(\Sigma\).

  • A string \(x\) is a substring of \(T\) if \(T = uxv\) for some strings \(u\) and \(v\).

  • For \(1 \leq i \leq n+1\), a string \(T_i = t_i...t_n\) is a suffix of \(T\). Note that \(T_{n+1} = \varepsilon\) is the empty suffix.

We denote the set of all suffixes of \(T\) as \(\sigma(T)\). For example, if \(T = \text{"cacao"}\), then \(\sigma(T) = \{ \text{"cacao"}, \text{"acao"}, \text{"cao"}, \text{"ao"}, \text{"o"}, \varepsilon\}\).

Definition 2 (Suffix Trie). The suffix trie of \(T\), denoted as \(STrie(T)\), is a trie that represents the set of all suffixes \(\sigma(T)\).

Formally, we define the suffix trie as an augmented Deterministic Finite Automaton (DFA): \(STrie(T) = (Q \cup \{\perp\}, root, F, g, f)\), where:

  • \(Q \cup \{\perp\}\) is the set of states.

  • \(root\) is the initial state, representing the empty string \(\varepsilon\).

  • \(F \subseteq Q \cup \{\perp\}\) is the set of final states, representing the set of suffixes \(\sigma(T)\).

  • \(g\) is the transition function: \(g(x, a) = y\) for states \(x, y \in Q\) and character \(a \in \Sigma\) if state \(y\) corresponds to the string \(xa\), given that state \(x\) corresponds to string \(x\).

  • \(f\) is the suffix function (or suffix link). For each state \(x \in Q\), \(x \neq root\), \(f(x) = \bar{y}\) where \(x = ay\) for some \(a \in \Sigma\) and \(\bar{y}\) is the state corresponding to the string \(y\). We define \(f(root) = \perp\).

  • \(\perp\) is an auxiliary state. It facilitates uniform handling of empty and non-empty suffixes. It is connected to the trie by \(g(\perp, a) = root\) for every \(a \in \Sigma\). The suffix function \(f(\perp)\) is undefined.

The states in \(Q\) have a one-to-one correspondence with substrings of \(T\). The initial state \(root\) corresponds to the empty string \(\varepsilon\). The final states \(F\) are those states that represent suffixes of \(T\). The transition function \(g(x, a) = y\) defines transitions in the trie: if state \(x\) represents a string \(x\), and there is a transition on character \(a\) to state \(y\), then state \(y\) represents the string \(xa\).

The suffix function \(f\) (or suffix link) is crucial for efficient suffix trie construction and is defined for each state \(x \neq root\). If a state \(x\) represents a string \(ay\), where \(a\) is a character and \(y\) is a string, then \(f(x)\) points to the state \(\bar{y}\) that represents the string \(y\). For the root state, \(f(root) = \perp\).

The auxiliary state \(\perp\) is introduced to simplify the algorithm’s logic, particularly in dealing with transitions related to the root and empty string. Transitions from \(\perp\) are defined such that \(g(\perp, a) = root\) for all characters \(a\) in the alphabet \(\Sigma\). This can be conceptually understood as \(\perp\) representing the inverse \(a^{-1}\) of all symbols \(a \in \Sigma\), ensuring consistency with other transitions since \(a^{-1}a = \varepsilon\), and \(root\) corresponds to \(\varepsilon\).

Example: Step-by-step Construction for "cacao"

Let’s illustrate the construction of a suffix trie for \(T = \text{"cacao"}\) by processing prefixes of \(T\) step-by-step.

Initial State: \(T^0 = \varepsilon\)

The suffix trie for the empty string \(T^0 = \varepsilon\), denoted \(STrie(T^0)\), consists of only the root state and the auxiliary state \(\perp\). There are no transitions yet, except for the implicit transitions from \(\perp\) to \(root\) for every character in the alphabet (though not explicitly shown in the figure for clarity).

Processing \(t_1 = \text{'c'}\), \(T^1 = \text{"c"}\)

The suffixes of \(T^1 = \text{"c"}\) are \(\{ \text{"c"}, \varepsilon\}\). Starting from \(STrie(T^0)\), we add transitions for the character ‘c’. From the \(root\), we add a transition on ‘c’ to a new state representing the suffix "c". Similarly, from \(\perp\), we also have an implicit transition on ‘c’ to \(root\).

Processing \(t_2 = \text{'a'}\), \(T^2 = \text{"ca"}\)

The suffixes of \(T^2 = \text{"ca"}\) are \(\{ \text{"ca"}, \text{"a"}, \varepsilon\}\). We extend \(STrie(T^1)\) to \(STrie(T^2)\). For each suffix of \(T^1\), we append ‘a’ and insert it into the trie.

  • For suffix "c" of \(T^1\): We traverse from the \(root\) following ‘c’. From the state representing "c", we add a transition on ‘a’ to a new state, representing "ca".

  • For suffix \(\varepsilon\) of \(T^1\) (represented by \(root\)): From the \(root\), we add a transition on ‘a’ to a new state, representing "a".

We also need to consider the empty suffix \(\varepsilon\), which is already represented by the \(root\).

Processing \(t_3 = \text{'c'}\), \(T^3 = \text{"cac"}\)

Suffixes of \(T^3 = \text{"cac"}\) are \(\{ \text{"cac"}, \text{"ac"}, \text{"c"}, \varepsilon\}\). We extend \(STrie(T^2)\).

  • For suffix "ca" of \(T^2\): From the state representing "ca", add a transition on ‘c’ to a new state for "cac".

  • For suffix "a" of \(T^2\): From the state representing "a", add a transition on ‘c’ to a new state for "ac".

  • For suffix \(\varepsilon\) of \(T^2\): From the \(root\), add a transition on ‘c’ to a state for "c" (which might already exist or be newly created if it doesn’t).

Processing \(t_4 = \text{'a'}\), \(T^4 = \text{"caca"}\)

Suffixes of \(T^4 = \text{"caca"}\) are \(\{ \text{"caca"}, \text{"aca"}, \text{"ca"}, \text{"a"}, \varepsilon\}\).

Processing \(t_5 = \text{'o'}\), \(T^5 = \text{"cacao"}\)

Suffixes of \(T^5 = \text{"cacao"}\) are \(\{ \text{"cacao"}, \text{"acao"}, \text{"cao"}, \text{"ao"}, \text{"o"}, \varepsilon\}\).

Step-by-step construction of \(STrie(\text{"cacao"})\). State transitions are shown by bold arrows, suffix links by thin arrows. Note: For visual clarity, only the suffix links for the last two layers are explicitly depicted. (Adapted from Figure 1 in [@Ukkonen1995])

Figure 1 illustrates the complete step-by-step construction of \(STrie(\text{"cacao"})\). The bold arrows represent the state transitions based on the characters of the suffixes, while the thin arrows indicate the suffix links, crucial for the on-line construction algorithm.

On-line Construction Principle

The suffix trie \(STrie(T)\) can be constructed on-line, meaning we process the input string \(T = t_1t_2...t_n\) from left to right, one character at a time. Let \(T^i = t_1...t_i\) denote the prefix of \(T\) of length \(i\). Our goal is to efficiently build \(STrie(T^i)\) from the previously constructed \(STrie(T^{i-1})\).

The fundamental principle behind on-line suffix trie construction is based on the relationship between the suffixes of \(T^i\) and \(T^{i-1}\). The set of suffixes of \(T^i\), \(\sigma(T^i)\), can be derived from the set of suffixes of \(T^{i-1}\), \(\sigma(T^{i-1})\), by applying two operations:

  1. Extension: For each suffix \(s\) in \(\sigma(T^{i-1})\), we form a new string by concatenating the current character \(t_i\) to it, resulting in \(st_i\).

  2. Initialization: We include the empty suffix \(\varepsilon\), which is always a suffix of any string (including \(T^i\)).

Thus, we can express the set of suffixes of \(T^i\) as: \[\sigma(T^i) = \{ st_i \mid s \in \sigma(T^{i-1}) \} \cup \{ \varepsilon\}\]

To implement this update from \(STrie(T^{i-1})\) to \(STrie(T^i)\), we focus on the final states \(F_{i-1}\) of \(STrie(T^{i-1})\). These final states represent the suffixes of \(T^{i-1}\). For each final state \(r \in F_{i-1}\), we check if there is already a transition from \(r\) on the character \(t_i\).

  • If there is no transition on \(t_i\) from \(r\), we create a new state \(r'\) and add a transition \(g(r, t_i) = r'\). This new state \(r'\) becomes a new leaf in the trie, representing the extended suffix.

  • If there is already a transition on \(t_i\) from \(r\), it means the suffix \(st_i\) is already represented in \(STrie(T^{i-1})\) (as a substring, not necessarily as a suffix ending at a final state). In this case, we do not need to create a new branch for this suffix.

The new set of final states \(F_i\) of \(STrie(T^i)\) consists of all states reachable from the states in \(F_{i-1}\) via a transition on \(t_i\) (whether newly added or pre-existing), together with the root state (representing the empty suffix).

Algorithm for On-line Suffix Trie Construction

Algorithm [alg:build_suffix_trie] provides a detailed procedure for constructing \(STrie(T^i)\) from \(STrie(T^{i-1})\). This algorithm is adapted from Algorithm 1 in Ukkonen’s 1995 paper [@Ukkonen1995].

\(r \leftarrow top\) Create a new state \(r'\) Create a new transition \(g(r, t_i) = r'\) Create a new suffix link \(f(oldr') = r'\) \(oldr' \leftarrow r'\) \(r \leftarrow f(r)\) Create new suffix link \(f(oldr') = g(r, t_i)\) \(top \leftarrow g(top, t_i)\)

Explanation of Algorithm [alg:build_suffix_trie]:

The algorithm starts at the state \(top\), which, in the \(i\)-th iteration, represents the prefix \(T^{i-1} = t_1...t_{i-1}\) (the longest suffix of \(T^{i-1}\), excluding \(\varepsilon\)). The core idea is to traverse the boundary path of \(STrie(T^{i-1})\). The boundary path is implicitly defined by following suffix links starting from \(top\) and moving towards the root (and eventually \(\perp\)). For each state \(r\) encountered on this path, the algorithm checks if a transition on character \(t_i\) is already defined from \(r\).

  • While loop (lines 2-8): This loop iterates through the boundary path as long as no \(t_i\)-transition is found from the current state \(r\).

    • Lines 3-4: If \(g(r, t_i)\) is undefined, it means that the suffix represented by the path from the root to \(r\), when extended by \(t_i\), is not yet in the trie as a suffix. Thus, a new state \(r'\) and a transition \(g(r, t_i) = r'\) are created, adding a new branch to the trie.

    • Lines 5-6: Suffix links are maintained during the construction. If \(r\) is not the initial \(top\) state (meaning we are not processing the longest suffix in the first iteration of the while loop), a suffix link from the state created in the previous iteration (\(oldr'\)) to the newly created state \(r'\) is established: \(f(oldr') = r'\). This ensures the correct suffix link structure is built.

    • Line 7: Update \(oldr'\) to the current newly created state \(r'\) for the next suffix link creation in the subsequent iteration (if any).

    • Line 8: Move to the next state on the boundary path by following the suffix link: \(r \leftarrow f(r)\). This effectively processes the next shorter suffix of \(T^{i-1}\).

  • Line 9: After the while loop terminates (when a state \(r\) on the boundary path already has a transition on \(t_i\)), a final suffix link is created: \(f(oldr') = g(r, t_i)\). This links the last created state to the state reached by the existing \(t_i\)-transition.

  • Line 10: Finally, the \(top\) state is updated to represent the longest suffix of \(T^i\). This is done by following the \(t_i\)-transition from the previous \(top\) state: \(top \leftarrow g(top, t_i)\). This new \(top\) will be the starting point for the next iteration (processing \(t_{i+1}\)).

Complexity Analysis of Suffix Trie Construction

In the worst case, for each character of the input string \(T\), we might traverse a significant portion of the existing suffix trie and add new nodes and edges. In scenarios like \(T = \text{"a}^n\text{b"}\), the size of the suffix trie can grow quadratically with the length of \(T\). Therefore, the time complexity for constructing a suffix trie using Algorithm [alg:build_suffix_trie] is proportional to the size of the resulting suffix trie.

Theorem 1. Suffix trie \(STrie(T)\) can be constructed on-line using Algorithm [alg:build_suffix_trie] in time proportional to the size of \(STrie(T)\). In the worst case, this construction time is \(O(|T|^2)\).

Description: This theorem states the time complexity of the on-line suffix trie construction algorithm. It highlights that the construction time is related to the size of the suffix trie, which can be quadratic in the worst case with respect to the input string length.

While suffix tries are conceptually straightforward, their quadratic space complexity in the worst case makes them less suitable for very long strings. This motivates the need for a more space-efficient representation, leading to the concept of suffix trees, which we will discuss in the next section.

Suffix Trees

Motivation for Space Optimization

Suffix tries, while conceptually simple and serving as a foundational stepping stone, suffer from a significant drawback: space inefficiency. In the worst-case scenario, the number of nodes and edges in a suffix trie can be quadratic with respect to the length of the input string \(T\). Consider, for instance, a string like \(\text{"abababab..."}\). Each suffix added to the trie will create a new path, leading to a quadratic increase in size. For long strings, this quadratic space complexity becomes prohibitive, limiting the practicality of suffix tries in real-world applications dealing with large datasets.

Suffix trees emerge as a space-optimized alternative to suffix tries. The core idea behind suffix trees is to represent suffix tries in a more compact manner, achieving linear space complexity, which is a significant improvement. Suffix trees retain the essential functionalities of suffix tries but drastically reduce the storage requirements, making them suitable for handling large strings and datasets. This space optimization is achieved by introducing the concepts of explicit and implicit states and by employing generalized transitions, which we will detail in the following subsections.

Explicit vs. Implicit States

To achieve space compression, suffix trees distinguish between two types of states: explicit states and implicit states. This distinction is crucial for reducing the number of nodes in the tree while preserving all suffix information.

Definition 3 (Explicit States). Explicit states in \(STree(T)\) are a subset of the states from \(STrie(T)\), denoted as \(Q' \cup \{\perp\}\). The set \(Q'\) includes:

  • Branching states: These are states in \(STrie(T)\) from which there are two or more outgoing transitions, indicating a branching point in the trie structure.

  • Leaves: These are states in \(STrie(T)\) that have no outgoing transitions. Each leaf node corresponds to the end of a unique suffix of \(T\).

  • The root state: The starting state of the suffix tree, representing the empty prefix.

  • The auxiliary state \(\perp\): Used for technical reasons, as in suffix tries, to handle transitions and suffix links consistently.

In essence, explicit states are the critical junctions and endpoints in the suffix trie structure that must be preserved to maintain the representation of all suffixes.

Definition 4 (Implicit States). Implicit states are all other states in \(STrie(T)\) that are not explicit. These are states (excluding the root and \(\perp\)) that have exactly one outgoing transition. Implicit states reside in the middle of "paths" in the suffix trie where there is no branching.

Implicit states are effectively compressed out in the suffix tree representation. Instead of explicitly representing every state along a single-child path, suffix trees represent these paths more compactly using edge labels that can represent strings, as we will see in the next section on generalized transitions.

Generalized Transitions

The space optimization in suffix trees is further achieved through the use of generalized transitions. In suffix tries, transitions are labeled with single characters. In contrast, suffix trees allow transitions between explicit states to be labeled with strings.

Definition 5 (Generalized Transitions). A generalized transition in \(STree(T)\) from an explicit state \(s\) to another explicit state \(r\) is denoted as \(g'(s, w) = r\), where \(w\) is a string. This string \(w\) represents the concatenation of characters along the path in \(STrie(T)\) from state \(s\) to state \(r\), where \(s\) and \(r\) are explicit states, and all intermediate states (if any) on this path in \(STrie(T)\) are implicit.

To efficiently store these string labels without using excessive space, suffix trees employ a pointer-based representation. Instead of storing the string \(w\) directly on the edge, we represent it using a pair of indices \((k, p)\) that refer to a substring \(t_k...t_p\) within the original input string \(T\).

Definition 6 (Pointer Representation of Transitions). A generalized transition \(g'(s, w) = r\) is implemented in \(STree(T)\) using a pointer pair \((k, p)\) such that \(w = t_k...t_p\). Thus, the transition is represented as \(g'(s, (k, p)) = r\). The index \(k\) is the starting position, and \(p\) is the ending position of the substring \(w\) in \(T\).

This pointer representation ensures that each transition in the suffix tree consumes constant space, regardless of the length of the string it represents, as we only store two integer indices.

Reference Pairs

To refer to any state in the suffix tree, whether explicit or implicit, we use the concept of reference pairs.

Definition 7 (Reference Pair). A reference pair \((s, w)\) for a state \(r\) in the suffix tree consists of two components:

  • An explicit state \(s\) that is an ancestor of \(r\) in \(STree(T)\).

  • A string \(w\) that needs to be traversed from state \(s\) to reach state \(r\). In other words, if we start at state \(s\) and follow the path spelled out by \(w\) in \(STree(T)\), we arrive at state \(r\).

Reference pairs provide a way to represent any position within the conceptual suffix trie structure using only explicit states and string offsets. To ensure uniqueness and efficiency, we use canonical reference pairs.

Definition 8 (Canonical Reference Pair). A reference pair \((s, w)\) is canonical if the explicit state \(s\) is the closest* explicit ancestor of the referenced state. This means that there are no explicit states on the path from \(s\) to the state represented by \((s, w)\), except for \(s\) itself and potentially the state represented by \((s,w)\) if it is also explicit.*

For an explicit state \(r\), the canonical reference pair is simply \((r, \varepsilon)\), as the closest explicit ancestor of an explicit state \(r\) is \(r\) itself, and no string needs to be traversed from \(r\) to reach \(r\). In pointer notation, this is represented as \((r, (p+1, p))\), where \(p+1 > p\) indicates an empty string.

Canonicalization is a process that, given any reference pair, finds the corresponding canonical reference pair representing the same state. This is crucial for the efficiency of suffix tree algorithms, as it ensures that each state is represented in a unique and minimized form.

Ukkonen’s On-line Construction Algorithm

Algorithm [alg:build_suffix_tree] (Algorithm 2 in [@Ukkonen1995]) details the on-line construction of suffix trees. It refines the suffix trie construction (Algorithm [alg:build_suffix_trie]) by incorporating optimizations that lead to linear space and time complexity. The key improvements stem from the use of implicit suffix tree representation, generalized transitions, and the specific procedures designed to efficiently update the tree as each character of the input string is processed.

Algorithm Overview

Ukkonen’s algorithm constructs the suffix tree \(STree(T)\) iteratively, processing the input string \(T\) character by character. Algorithm [alg:build_suffix_tree] outlines the main steps.

Create states \(root\) and \(\perp\) Create transition \(g'(\perp, (-j, -j)) = root\) Create suffix link \(f'(root) = \perp\) \(s \leftarrow root\); \(k \leftarrow 1\); \(i \leftarrow 0\) \(i \leftarrow i + 1\) \((s, k) \leftarrow \texttt{update}(s, (k, i))\) \((s, k) \leftarrow \texttt{canonize}(s, (k, i))\)

Initialization (Lines 1-4): The algorithm begins by creating the root state and the auxiliary state \(\perp\). It then initializes transitions from \(\perp\) to \(root\) for each character in the alphabet. These transitions are represented using pointer pairs \((-j, -j)\) as a technical detail to handle alphabet characters uniformly. A suffix link from the root to \(\perp\) is also initialized.

Iteration (Lines 5-8): The algorithm then iterates through the input string \(T\), character by character, until it encounters a special end marker \(\#\). In each iteration (for character \(t_{i+1}\)):

  • Update (Line 7): The update procedure is called with the current active point \((s, (k, i))\) and the current character \(t_{i+1}\). This procedure is responsible for adding all necessary new suffixes ending at position \(i+1\) to the suffix tree. It returns a potentially updated active point \((s, k)\).

  • Canonize (Line 8): The canonize procedure is then called to ensure that the active point \((s, (k, i))\) is in its canonical form. This step is crucial for efficiency and correctness.

The algorithm maintains an active point represented by a canonical reference pair \((s, (k, i))\). This active point conceptually marks the end of the longest suffix of the currently processed prefix \(T^i\) in the suffix tree. The update and canonize procedures work together to efficiently extend the suffix tree to incorporate the next character and maintain the canonical representation of the active point.

Update Procedure

The update\((s, (k, i))\) procedure (Algorithm [alg:update]) is the heart of Ukkonen’s algorithm. It efficiently incorporates all new suffixes ending at the current position \(i\) into the suffix tree.

\(oldr \leftarrow root\) \((end\_point, r) \leftarrow \texttt{test\_and\_split}(s, (k, i-1), t_i)\) Create new transition \(g'(r, (i, \infty)) = r'\) where \(r'\) is a new leaf state Create new suffix link \(f'(oldr) = r\) \(oldr \leftarrow r\) \((s, k) \leftarrow \texttt{canonize}(f'(s), (k, i-1))\) \((end\_point, r) \leftarrow \texttt{test\_and\_split}(s, (k, i-1), t_i)\) Create new suffix link\(f'(oldr) = s\) return \((s, k)\)

Procedure Breakdown:

  1. Initialization (Line 1): Initialize a variable \(oldr\) to the root. This variable is used to keep track of the previously created internal node for suffix link creation.

  2. Test-and-Split (Line 2): Call the test_and_split procedure with the current active point \((s, (k, i-1))\) and the new character \(t_i\). test_and_split checks if the active point is already an endpoint for the extension with \(t_i\). It returns:

    • \(end\_point\): A boolean flag indicating whether the active point is an endpoint.

    • \(r\): The explicit state where a new branch should be created if \(end\_point\) is false. This might be a newly created split node or the current state \(s\).

  3. While Loop (Lines 3-11): The loop continues as long as \(end\_point\) is false, meaning new suffixes need to be added.

    • Create Open Transition (Line 4): Create a new leaf node \(r'\) and an open transition \(g'(r, (i, \infty)) = r'\) from state \(r\) to \(r'\) labeled with the open range \((i, \infty)\). Open transitions are a key optimization: the end index \(\infty\) signifies that the end position of the string on this edge is currently the end of the processed string and will implicitly extend as more characters are added.

    • Suffix Link Creation (Lines 5-7): If \(oldr\) is not the root (meaning a split occurred in a previous iteration), create a suffix link from \(oldr\) to the current node \(r\): \(f'(oldr) = r\). This links the internal node created in the previous step to its corresponding suffix node. Update \(oldr\) to \(r\) for the next iteration’s suffix link.

    • Canonize Active Point (Line 8): Update the active point \((s, k)\) by moving to the suffix link of the current active state \(s\) and then canonizing the reference pair: \((s, k) \leftarrow \texttt{canonize}(f'(s), (k, i-1))\). This efficiently moves the active point to represent the next shorter suffix.

    • Test-and-Split Again (Line 9): Call test_and_split again from the new active point \((s, (k, i-1))\) and character \(t_i\) to check if further extensions are needed. Update \(end\_point\) and \(r\).

    • Final Suffix Link (Lines 10-11): After the loop terminates (because test_and_split returns \(end\_point = true\)), if \(oldr\) is not root, create a final suffix link \(f'(oldr) = s\). This handles the suffix link for the last internal node created in the loop.

  4. Return (Line 12): Return the updated active point \((s, k)\).

The update procedure efficiently adds all new suffixes by traversing an implicit boundary path using suffix links and open transitions, ensuring linear time complexity in conjunction with test-and-split and canonize.

Test-and-Split Procedure

The test-and-split\((s, (k, p), t)\) procedure (Algorithm [alg:test_and_split]) is crucial for determining if the current active point is an endpoint for a new extension and for making implicit states explicit when necessary.

Let \(g'(s, (k', p')) = s'\) be the transition from \(s\) starting with \(t_{k'}\) return \((true, s)\) Create new explicit state \(r\) Replace transition \(g'(s, (k', p'))\) with \(g'(s, (k', k+p-k)) = r\) and \(g'(r, (k'+p-k+1, p')) = s'\) return \((false, r)\) return \((true, s)\) return \((false, s)\)

Procedure Breakdown:

  1. Case 1: Active point is on an edge (Line 1): If \(k \leq p\) in the reference pair \((s, (k, p))\), the active point is located within an edge originating from state \(s\), representing a string \(t_k...t_p\).

    • Character Match (Line 3): Retrieve the transition \(g'(s, (k', p')) = s'\) starting at index \(k'\). Check if the character \(t\) matches the character in \(T\) immediately following the substring represented by the active point, i.e., \(t_{k'+p-k+1}\). If they match, it means the extension by \(t\) is already implicitly present, so return \((true, s)\), indicating an endpoint is found and no split is needed.

    • Character Mismatch (Line 5): If the characters do not match, it signifies that the edge needs to be split to make the implicit active point explicit.

      • Split Edge (Line 6-7): Create a new explicit state \(r\). Replace the original transition \(g'(s, (k', p'))\) with two new transitions:

        • \(g'(s, (k', k+p-k)) = r\): From \(s\) to the new state \(r\), representing the substring up to the active point.

        • \(g'(r, (k'+p-k+1, p')) = s'\): From the new state \(r\) to \(s'\), representing the remainder of the original edge.

      • Return Split (Line 8): Return \((false, r)\). \(false\) indicates that it was not an endpoint (a split occurred), and \(r\) is the new explicit state created at the split point.

  2. Case 2: Active point is at an explicit state (Line 9): If \(k > p\) (or \(k=p+1\)), the active point is at an explicit state \(s\).

    • Transition Exists (Line 10): Check if there is already a transition from \(s\) on character \(t\). If such a transition exists, it means the extension is already present, so return \((true, s)\), indicating an endpoint is found.

    • Transition Does Not Exist (Line 12): If there is no transition from \(s\) on character \(t\), it means a new branch needs to be created from \(s\). Return \((false, s)\). \(false\) indicates no endpoint, and \(s\) is the state from which a new branch will be created in the update procedure.

test-and-split efficiently determines if an extension is already present and handles the crucial edge splitting operation to maintain the explicit state structure of the suffix tree, which is essential for linear space and time complexity.

Canonize Procedure

The canonize\((s, (k, p))\) procedure (Algorithm [alg:canonize]) ensures that a given reference pair \((s, (k, p))\) is always in its canonical form. Canonicalization is essential for efficiently navigating the suffix tree and maintaining the linear time complexity of Ukkonen’s algorithm.

return \((s, k)\) Find the \(t_k\)-transition \(g'(s, (k', p')) = s'\) from \(s\) \(k \leftarrow k + p' - k' + 1\) \(s \leftarrow s'\) Find the \(t_k\)-transition \(g'(s, (k', p')) = s'\) from \(s\) return \((s, k)\) return \((s, k)\)

Procedure Breakdown:

  1. Empty String Check (Line 1): If \(p < k\), it implies that the string \(w\) in the reference pair \((s, w)\) is empty. In this case, the reference pair is already canonical, so return \((s, k)\) as is.

  2. Find Transition (Line 3): Find the transition \(g'(s, (k', p')) = s'\) from state \(s\) that is labeled with a string starting with the character \(t_k\) (the first character of the remaining string \(w = t_k...t_p\)).

  3. While Loop (Lines 4-11): The loop continues as long as the length of the edge label \((p' - k' + 1)\) is less than or equal to the length of the remaining string to be canonized \((p - k + 1)\). This condition checks if we can move further down the tree to a closer explicit ancestor.

    • Advance along Edge (Lines 5-6): If the condition in the ‘while’ loop is true, it means we can move down to the next explicit state \(s'\).

      • Update \(k\): Advance the starting index \(k\) by the length of the edge label: \(k \leftarrow k + p' - k' + 1\). This effectively consumes the portion of the string \(w\) corresponding to the edge label.

      • Update \(s\): Move the explicit state \(s\) to the next explicit state \(s'\): \(s \leftarrow s'\). Now, \(s'\) becomes the new starting explicit state in the reference pair.

    • Check if More Canonization Needed (Line 7-10): After moving to \(s'\), check if there is still a non-empty remaining string to be canonized (\(k \leq p\)).

      • Continue Canonization (Line 9): If \(k \leq p\), find the next transition from the new state \(s\) starting with the character \(t_k\) to continue the canonization process in the next iteration of the ‘while’ loop.

      • Canonical Pair Found (Line 11): If \(k > p\), it means the entire string \(w\) has been traversed, and the reference pair \((s, (k, p))\) is now canonical. Return \((s, k)\).

  4. Return Canonical Pair (Line 13): If the ‘while’ loop terminates because the edge label is longer than the remaining string, it means we have found the closest explicit ancestor \(s\), and the remaining string \(t_k...t_p\) is the minimal string needed to reach the state. Return the canonical reference pair \((s, k)\).

The canonize procedure efficiently adjusts the reference pair to its canonical form by traversing down the suffix tree along edges as long as possible, ensuring that the explicit state in the pair is always the closest explicit ancestor. This process is crucial for the overall linear time complexity of Ukkonen’s algorithm.

Theorem 2. Ukkonen’s on-line suffix tree construction algorithm (Algorithm [alg:build_suffix_tree]), using the update (Algorithm [alg:update]), test-and-split (Algorithm [alg:test_and_split]), and canonize (Algorithm [alg:canonize]) procedures, constructs the suffix tree \(STree(T)\) for a string \(T\) of length \(n\) in \(O(n)\) time.

Description: This theorem formally states the linear time complexity of Ukkonen’s algorithm for suffix tree construction. It highlights that by using the procedures update, test-and-split, and canonize, the algorithm achieves an optimal time complexity of O(n), where n is the length of the input string.

The linear time complexity is achieved through amortized analysis. While individual update steps might seem complex, the total work done across all updates is linear in the length of the input string. The key optimizations like suffix links and open edges, combined with the canonical reference pair representation and efficient procedures, ensure that the algorithm processes each character in amortized constant time, leading to an overall linear time construction.

Step-by-step construction of \(STree(\text{"cacao"})\). State transitions are shown by bold arrows, suffix links by thin arrows. Note: For visual clarity, only the suffix links for the last two layers are explicitly depicted. (Adapted from Figure 2 in [@Ukkonen1995])

Figure 2 illustrates the step-by-step construction of the suffix tree for the string "cacao" using Ukkonen’s algorithm. It visually demonstrates how the tree evolves as each character is processed, highlighting the compact representation achieved by suffix trees compared to suffix tries.

Conclusion

Ukkonen’s algorithm stands out as a highly efficient and conceptually elegant method for constructing suffix trees in linear time. Its on-line nature, processing the input string character by character, makes it particularly suitable for streaming data and real-time applications. The algorithm’s efficiency is further enhanced by the use of canonical reference pairs, which provide a compact and unique way to represent states, and open transitions, which defer the explicit update of edge endpoints, saving computational effort. These techniques, combined with the clever use of suffix links, allow Ukkonen’s algorithm to achieve linear time complexity, a significant improvement over naive quadratic-time suffix trie construction.

The linear time complexity of Ukkonen’s algorithm is formally proven through amortized analysis of the canonize and update procedures. While individual steps within these procedures might take varying amounts of time depending on the input string and the current state of the tree, the average time complexity over the entire construction process remains linear with respect to the input string length. This amortized linearity is a hallmark of Ukkonen’s algorithm and is crucial for its practical applicability in handling large-scale string data.

Moreover, Ukkonen’s algorithm provides a valuable pedagogical framework for understanding more advanced string data structures. The suffix trie structure, augmented with suffix links, which forms the basis of Ukkonen’s method, serves as an intuitive stepping stone towards understanding and constructing suffix automata (DAWGs), also known as Directed Acyclic Word Graphs. Suffix automata are minimal Deterministic Finite Automata (DFAs) that recognize all suffixes of a string, offering an even more space-efficient representation than suffix trees in some applications. Ukkonen’s algorithm, therefore, not only solves the suffix tree construction problem efficiently but also illuminates the relationships between different string indexing structures, making it a cornerstone concept in string algorithms and a gateway to further explorations in the field.

In conclusion, Ukkonen’s on-line suffix tree construction algorithm is a landmark achievement in stringology. Its linear time complexity, on-line processing capability, and conceptual clarity have made it an algorithm of choice for both theoretical advancements and practical applications requiring efficient and scalable string indexing and manipulation. The algorithm’s impact resonates across diverse fields, underscoring its fundamental importance in computer science and related disciplines.

Exercises

  1. Construct the suffix trie and suffix tree for the string "banana" using the on-line approach described in this lesson. Trace the steps for each character addition and illustrate the resulting structures.

  2. Trace the execution of Algorithm [alg:build_suffix_tree] for the input string "abaaba". For each character processed, show the state of the suffix tree, including explicit and implicit states, transitions, and suffix links. Highlight the active point and endpoint at each step.

  3. Explain in detail the role of suffix links in Ukkonen’s algorithm. How do suffix links contribute to achieving linear time complexity? Provide specific examples to illustrate their function during the update and canonize procedures.

  4. What are the key advantages of using suffix trees over suffix tries? Discuss the space and time complexity trade-offs between these two data structures. In what specific application scenarios would you choose suffix trees over suffix tries, and conversely, when might suffix tries be more appropriate despite their space inefficiency? Consider factors such as string length, alphabet size, and the specific operations to be performed.

  5. Define the following terms in your own words, ensuring clarity and precision. For each term, provide a concise example and explain its significance within the context of Ukkonen’s algorithm and suffix tree construction:

    • Explicit State:

    • Implicit State:

    • Reference Pair:

    • Canonical Reference Pair:

    • Open Transition:

    • Boundary Path:

    Highlight how these concepts contribute to the efficiency and space optimization of suffix trees built using Ukkonen’s algorithm.

99 A. Aho and M. Corasick, Efficient string matching: an aid to bibliographic search, Comm, ACM, 18 (1975), 333-340. A. Amir and M. Farach, Adaptive dictionary matching, Proc. 32nd IEEE Ann. Symp. on Foundations of Computer Science, 1991, pp. 760-766. A. Apostolico, The myriad virtues of subword trees, in Combinatorial Algorithms on Words (A. Apostolico and Z. Galil, eds.), Springer-Verlag, New York, 1985, pp. 85-95. A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen, and J. Seiferas, The smallest automaton recognizing the subwords of a text, Theoret. Comput. Sci., 40 (1985), 31-55. M. Crochemore, Transducers and repetitions, Theoret. Comput. Sci., 45 (1986), 63-86. M. Crochemore, String matching with constraints, in Mathematical Foundations of Computer Science 1988 (M. P. Chytil, L. Janiga and V. Koubek, eds.), Lecture Notes in Computer Science, vol. 324, Springer-Verlag, Berlin, 1988, pp. 44-58. Z. Galil and R. Giancarlo, Data structures and algorithms for approximate string matching, J. Complexity, 4 (1988), 33-72. M. Kempf, R. Bayer, and U. Güntzer, Time optimal left to right construction of position trees, Acta Inform., 24 (1987), 461-474. E. McCreight, A space-economical suffix tree construction algorithm, J. Assoc. Comput. Mach., 23 (1976), 262-272. E. Ukkonen, Constructing suffix trees on-line in linear time, in Algorithms, Software, Architecture. Information Processing 92, vol. I (J. van Leeuwen, ed.), Elsevier, Amsterdam, 1992, pp. 484-492. E. Ukkonen, Approximate string-matching over suffix trees, in Combinatorial Pattern Matching, CPM ’93 (A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, eds.), Lecture Notes in Computer Science, vol. 684, Springer-Verlag, Berlin 1993, pp. 228-242. E. Ukkonen and D. Wood, Approximate string matching with suffix automata, Algorithmica, 10 (1993), 353-364. P. Weiner, Linear pattern matching algorithms, Proc. IEEE 14th Ann. Symp. on Switching and Automata Theory, 1973, pp. 1-11. E. Ukkonen, On-line construction of suffix trees, Algorithmica, 14 (1995), 249-260.