Advanced Algorithms: Lowest Common Ancestor and String Alignment
Introduction
This lecture introduces two important topics in advanced algorithms: the Lowest Common Ancestor (LCA) problem in trees and the problem of string alignment. Understanding the Lowest Common Ancestor is crucial in various tree-based algorithms and data structures, with applications in phylogenetic analysis and network routing. We will explore efficient techniques for LCA computation, including a constant-time retrieval method after suitable preprocessing. The second part of the lecture shifts focus to string alignment, a fundamental problem in bioinformatics and text processing. String alignment is essential for comparing biological sequences, text documents, and for tasks like spell checking and DNA sequencing. We will define different types of alignments, discuss distance metrics such as Levenshtein and Hamming distance, and lay the groundwork for more advanced alignment algorithms to be discussed in subsequent lectures. By the end of this lecture, you should understand the basic concepts of LCA and string alignment, and appreciate the need for efficient algorithms to solve these problems.
Lowest Common Ancestor (LCA)
Ancestor Relationship and Lemma 8.7.1
We begin by formalizing the ancestor-descendant relationship in rooted trees and introduce a lemma concerning mappings between such trees that preserve this relationship. Let \(T\) and \(B\) be two rooted trees, and let \(I\) be a mapping from the nodes of \(T\) to the nodes of \(B\). We assume that the mapping \(I\) "respects" the ancestor relationship, as described in Lemma 1.
Lemma 1 (Lemma 8.7.1). If \(z\) is an ancestor of \(x\) in tree \(T\), then \(I(z)\) is an ancestor of \(I(x)\) in tree \(B\). More specifically, if \(z\) is an ancestor of \(x\) in \(T\), then either \(I(z) = I(x)\) or \(I(z)\) is a proper ancestor of \(I(x)\) in \(B\).
Proof. Interpretation and Proof Sketch (based on notes). The lemma essentially states that the mapping \(I\) preserves the ancestor-descendant relationship. If \(z\) is an ancestor of \(x\) in \(T\), then its image \(I(z)\) is an ancestor of the image of \(x\), \(I(x)\), in \(B\).
Let \(\mathcal{C}(x)\) denote the set of ancestors of \(x\) in \(T\), and \(\mathcal{B}(I(x))\) denote the set of ancestors of \(I(x)\) in \(B\). The lemma can be formally written as: \[z \in \mathcal{C}(x) \implies I(z) \in \mathcal{B}(I(x))\]
To prove this, we consider the heights of nodes. Let \(h(v)\) denote the height of a node \(v\) in a tree.
Assume \(z\) is an ancestor of \(x\) in \(T\). We want to show that \(I(z)\) is an ancestor of \(I(x)\) in \(B\).
Case 1: \(I(z) = I(x)\). In this case, \(I(z)\) is trivially an ancestor of \(I(x)\).
Case 2: \(I(z) \neq I(x)\). Assume, for contradiction, that \(I(z)\) is not an ancestor of \(I(x)\) in \(B\). If \(I(z)\) is not an ancestor of \(I(x)\), then \(I(z)\) cannot be on the path from the root of \(B\) to \(I(x)\). However, the provided notes suggest considering binary representations and heights, which is unclear without more context on how the mapping \(I\) and trees \(T, B\) are constructed and represented.
Further Clarification Needed: The proof sketch in the notes is incomplete and lacks crucial details about the mapping \(I\) and the trees \(T\) and \(B\). Specifically, the reference to binary representations and heights requires more context. It’s possible that the mapping \(I\) is related to some form of tree embedding or encoding that preserves ancestor relationships based on bit prefixes or similar properties. Without additional information or context on the specific construction of \(I\), \(T\), and \(B\), a complete rigorous proof based solely on these notes is not feasible. (Detail missing: teacher explanation unclear regarding the intended proof method and the nature of mapping \(I\)). ◻
Theorems for LCA Computation
We now discuss two key theorems that pave the way for efficient LCA computation, particularly aiming for constant-time retrieval after preprocessing.
Theorem 1: Efficient Determination of LCA
Theorem 2 (Theorem 1: Efficient Determination of LCA). Given two nodes \(x\) and \(y\) in a tree \(T\), let \(z = \text{lca}(x, y)\) be their lowest common ancestor. If we know the height of the mapped node \(I(z)\) in tree \(B\), denoted as \(h(I(z))\), we can efficiently determine \(z\) in \(T\). Specifically, after some pre-processing, \(z\) can be found in \(O(1)\) time.
Proof. Algorithm Sketch and Explanation (from notes). Theorem 2 suggests that if we can determine the height of \(I(z)\) in \(B\), we can efficiently locate \(z\) in \(T\). Let \(j = h(I(z))\). The algorithm aims to traverse \(T\) to find the node \(w\) such that \(h(I(w)) = j\) and \(w\) is the LCA.
Algorithm Sketch for Theorem 2:
Input: Nodes \(x, y\) in tree \(T\).
Goal: Find \(z = \text{lca}(x, y)\) in \(T\).
Pre-processing: (Requires pre-computation of heights and potentially ancestor information for efficient traversal, details not fully specified in notes). This step is crucial for achieving \(O(1)\) query time. It likely involves techniques to quickly access nodes at a given height and to navigate ancestor-descendant relationships efficiently.
Algorithm Steps:
Compute \(j = h(I(z))\) (Assume we can obtain this height from Theorem 3, discussed next).
Start traversal from the root of \(T\) (or potentially a relevant subtree if the structure allows for optimization).
Traverse down the tree \(T\). For each node \(w\) encountered:
Note on Algorithm Clarity: The algorithm description from the notes is highly abstract and lacks precise steps. The concepts of "runs," "leader," and the operation "to the left of \(h(I(w))\)" are not defined, making it difficult to reconstruct a concrete, working algorithm from this description alone. Significant details are missing regarding the pre-processing and the exact traversal and comparison methods. (Detail missing: teacher explanation unclear on the algorithmic details and the meaning of "run", "leader", and bit operations). ◻
Theorem 2: Determining LCA Height in Mapped Tree
Theorem 3 (Theorem 2: Determining LCA Height in Mapped Tree). Given nodes \(x\) and \(y\) in tree \(T\), and some representation of these nodes, denoted as \(A_x\) and \(A_y\), we can determine the height of \(I(z)\) in tree \(B\), where \(z = \text{lca}(x, y)\) in \(T\).
Proof. Determining \(h(I(z))\) and Proof Sketch (from notes). Theorem 3 proposes a method to calculate \(h(I(z))\) based on some representations \(A_x\) and \(A_y\) of nodes \(x\) and \(y\). The notes define \(h(I(z)) = j\) as: \[j = \min \{k \mid k \geq h(\text{lca}(I(x), I(y))) \land A_x^{(k)} = A_y^{(k)} = 1 \}\] where \(A_x^{(k)}\) and \(A_y^{(k)}\) are presumably the \(k\)-th bits of the representations \(A_x\) and \(A_y\). However, the exact nature of \(A_x, A_y\) and their bit representations is not specified in the notes. It’s likely they are derived from the paths from the root to \(x\) and \(y\) in \(T\), or some level-ancestor information.
Proof Sketch for Theorem 3: Let \(k = h(I(z))\). We need to show that \(k \geq j\) and \(k \leq j\).
Show \(k \geq j\): The notes state: "If \(A_x\) and \(A_y\) have a 1 in position \(k\), then \(k \geq j\)." The reasoning behind this is unclear from the notes. It suggests that the condition \(A_x^{(k)} = A_y^{(k)} = 1\) for \(k \geq h(\text{lca}(I(x), I(y)))\) is related to the height of the LCA. More context is needed to understand why this implies \(k \geq j\). (Reasoning unclear from notes).
Show \(k \leq j\): Let \(b = \text{lca}(I(x), I(y))\) in tree \(B\), and let \(h(b)\) be its height. Let \(j = \min \{k \mid k \geq h(b) \land A_x^{(k)} = A_y^{(k)} = 1 \}\). We want to show \(k \leq j\). Let \(x', y'\) be ancestors of \(x, y\) in \(T\) such that \(I(x'), I(y')\) are at height \(j\) in \(B\). The notes state "Let \(x', y'\) be nodes mapped to height \(j\) (ancestors of \(x, y\) in \(T\)). Then \(I(x'), I(y')\) are ancestors of \(I(x), I(y)\) in \(B\)." and "Since \(I(x') = I(y')\), we have \(I(x') = I(y')\)." This last statement "Since \(I(x') = I(y')\), we have \(I(x') = I(y')\)" is tautological and likely a transcription error. It should probably mean "Since \(I(x') = I(y')\) is implied by the definition of \(j\) and LCA properties...". Then, " \(z'\) is an ancestor of both \(x\) and \(y\) in \(T\). So \(z'\) must map higher than \(z = \text{lca}(x, y)\), thus \(j \geq k\)." This part of the sketch is also fragmented and requires more detailed logical connections. It seems to argue that by choosing \(j\) as defined, we are finding a common ancestor at height \(j\), and because \(j\) is minimized under the given conditions, it relates to the height of the LCA. (Proof sketch is fragmented and requires more detail).
N.B.: Theorems 2 and 3, when combined with appropriate pre-processing and a clear definition of representations \(A_x, A_y\) and mapping \(I\), are intended to enable linear time LCA computation. However, based on the provided notes, the details are too sparse to fully reconstruct a complete and rigorous method. ◻
Constant-time LCA Retrieval Algorithm
The notes outline an algorithm for constant-time LCA retrieval, presumably leveraging Theorems 2 and 3. However, given the lack of clarity in the theorems and their proofs, the algorithm description is also challenging to interpret directly.
Input: Nodes \(x, y\) in tree \(T\). Output: \(z = \text{lca}(x, y)\) in tree \(T\). Find the lowest common ancestor \(b = \text{lca}(I(x), I(y))\) in tree \(B\). Find the smallest position \(j \geq h(b)\) such that both numbers \(A_x\) and \(A_y\) have 1-bits in position \(j\). Set \(j = h(I(z))\). Find node \(\bar{x}\), the closest node to \(x\) on the same run as \(z\) (although we don’t know \(z\) yet), as follows:
Find the position \(l\) of the right-most 1-bit in \(A_x\).
If \(l = j\), then set \(\bar{x} = x\) (if \(x\) and \(z\) are on the same run in \(T\)) and go to step 4.
Else (when \(l < j\)): Find the position \(k\) of the left-most 1-bit in \(A_x\) that is to the right of position \(j\). Form the number consisting of the bits of \(I(x)\) to the left of position \(k\), followed by a 1-bit in position \(k\), followed by all zeros. Let this number be \(I(w)\). Look up node \(w = L(I(w))\), which must be node \(w\). Set \(\bar{x}\) to be the parent of node \(w\) in \(T\).
Find node \(\bar{y}\), the closest node to \(y\) on the same run as \(z\), by the same approach as in step 3. If \(\bar{x} < \bar{y}\) (using some ordering on nodes, e.g., pre-order index), then set \(z = \bar{x}\), else set \(z = \bar{y}\). Return \(z\).
Note: Tree \(B\) is used conceptually in step 1 to find \(\text{lca}(I(x), I(y))\), but it is not explicitly constructed or manipulated in the algorithm steps as described. The algorithm heavily relies on the representations \(A_x, A_y\), the mapping \(I\), and a reverse lookup \(L\), whose details are not provided in the notes. The comparison \(\bar{x} < \bar{y}\) in step 5 suggests an ordering on nodes of \(T\) is assumed. (N.B. \(\mathcal{B}\) is only in the argument of the algorithm description, not explicitly used in the steps).
Efficient Computation Techniques
To implement efficient algorithms, especially for operations within the LCA retrieval process and potentially for string alignment, we often rely on techniques for constant-time operations and lookup tables.
Constant-Time Operations on Bit Strings
We assume that for bit strings of length \(O(\log n)\), where \(n\) is the size of the input (e.g., number of nodes in a tree, length of strings), we can perform the following operations in \(O(1)\) time:
Read, write, and address access to individual bits.
Arithmetic operations: Addition, subtraction, multiplication, division.
Bitwise shift operations:
Shift left by \(i\) positions: equivalent to multiplication by \(2^i\). (Note in notes: if \(i \in O(\log n)\)).
Shift right by \(i\) positions: equivalent to division by \(2^i\).
Bitwise logical operations: AND, OR, XOR, NOT.
These assumptions are reasonable for typical word-RAM models of computation when dealing with numbers that fit within a machine word (typically \(O(\log n)\) bits if \(n\) is the input size).
Lookup Tables for Complex Operations
For more complex bit manipulations or operations that are not directly supported in constant time, we can use pre-computed lookup tables. This trades space for time.
Example: Leftmost 1 Bit Position
Finding the position of the leftmost (most significant) 1-bit in a binary string is a common operation.
Example 4. For example, if we are working with 4-bit numbers, we can create a table to find the position of the leftmost 1. Let’s index positions from right to left, starting at 0.
| Input (Binary) | Leftmost 1 Position |
|---|---|
| 0000 | -1 (or 0, indicating no 1s) |
| 0001 | 0 |
| 0010 | 1 |
| 0011 | 1 |
| 0100 | 2 |
| 0101 | 2 |
| 0110 | 2 |
| 0111 | 2 |
| 1000 | 3 |
| 1001 | 3 |
| 1010 | 3 |
| 1011 | 3 |
| 1100 | 3 |
| 1101 | 3 |
| 1110 | 3 |
| 1111 | 3 |
For input ‘0110’ (decimal 6), the leftmost 1 is at position 2 (counting from right starting at 0).
Tables for Binary Operations
For binary operations like XOR, AND, OR applied to \(\log n\)-bit numbers, a naive table approach might seem too large.
Naive Approach
A direct lookup table for a binary operation on two \(\log n\)-bit numbers would require an input space of \((\log n + \log n)\) bits, or \(2^{\log n} \times 2^{\log n} = n^2\) entries. If each entry stores a \(\log n\)-bit result, the total space would be \(O(n^2 \log n)\). This quadratic space complexity is generally too high for practical use, especially for larger \(n\). (Notes say "NO").
Divide and Conquer Strategy for Table Size Reduction
To reduce the table size, we can use a divide and conquer approach. Assume we want to perform a binary operation (e.g., XOR) on two \(\log n\)-bit numbers \(x\) and \(y\). Let’s divide each \(\log n\)-bit number into two halves, each of \(k = \frac{1}{2}\log n = \frac{\log \sqrt{n}}{2}\) bits (assuming \(\log n\) is even for simplicity, or we can adjust slightly if odd). Let \(x = x_{high} \cdot 2^k + x_{low}\) and \(y = y_{high} \cdot 2^k + y_{low}\), where \(x_{high}, x_{low}, y_{high}, y_{low}\) are \(k\)-bit numbers.
To compute \(x \text{ XOR } y\), we can observe that bitwise operations are independent for each bit position. Therefore, we can compute the XOR of the corresponding halves separately: \((x \text{ XOR } y) = (x_{high} \text{ XOR } y_{high}) \cdot 2^k + (x_{low} \text{ XOR } y_{low})\).
We can pre-compute lookup tables for the binary operation (e.g., XOR, AND, OR) on \(k\)-bit numbers.
This divide and conquer approach reduces the space complexity from \(O(n^2)\) for a direct table to \(O(n)\) by using tables for smaller \(k\)-bit operations and combining the results.
String Alignment Fundamentals
We now transition to the problem of string alignment, a crucial concept in bioinformatics, text processing, and other areas where sequence comparison is important.
Introduction to String Alignment and its Applications
String alignment is concerned with arranging two or more sequences (strings) to identify regions of similarity, which may be a consequence of functional, structural, or evolutionary relationships between the sequences. In bioinformatics, sequences can represent DNA, RNA, or protein sequences. In text processing, they can be words, sentences, or documents.
Basic Questions in String Alignment:
What does it mean to align two or more sequences (e.g., of DNA)? Aligning sequences involves arranging them in a way that highlights their similarities and differences, often by inserting gaps in the sequences.
Why is the alignment problem suitable for illustrating algorithms? String alignment problems are excellent examples for demonstrating various algorithmic techniques, including dynamic programming, greedy algorithms, and heuristic approaches. They also illustrate concepts of optimization, distance metrics, and computational complexity.
What are our objectives in string alignment? The primary objective is usually to find an alignment that optimizes a certain criterion, such as maximizing similarity or minimizing the "distance" (cost of transformations) between the sequences.
What data do we use for experiments in string alignment? For bioinformatics, typical data includes DNA, RNA, and protein sequences from databases like GenBank, EMBL, and UniProt. For text alignment, one might use text documents, code snippets, or sentences.
Alignment Representation using Edit Strings
Aligning two strings means converting the first string into the second string using a sequence of edit operations.
Consider two strings \(\sigma_1\) and \(\sigma_2\):
σ₁ ≡ a c g t c a t c a
σ₂ ≡ t a a g t g t c a
To transform \(\sigma_1\) into \(\sigma_2\), we can use a series of basic edit operations.
Alignment as Edit String
An alignment of \(\sigma_1\) and \(\sigma_2\) can be represented as an edit-string over the alphabet \(\{m, i, d, s\}\), where each symbol represents an edit operation:
\(m\) (match): Characters at the current positions in \(\sigma_1\) and \(\sigma_2\) are the same. No operation needed.
\(i\) (insert): Insert a character into \(\sigma_1\) to match the character in \(\sigma_2\) at the current position. Effectively, we are inserting a gap in \(\sigma_1\) opposite to a character in \(\sigma_2\).
\(d\) (delete): Delete a character from \(\sigma_1\). Effectively, we are inserting a gap in \(\sigma_2\) opposite to a character in \(\sigma_1\).
\(s\) (substitute): Substitute a character in \(\sigma_1\) with a character to match \(\sigma_2\) at the current position.
Example Alignment and Edit String: For \(\sigma_1 = \text{acgtcata}\) and \(\sigma_2 = \text{taagtgtca}\), a possible alignment and corresponding edit-string is:
σ₁ ≡ a c g t c a t c a
i m s m m s d m m m
σ₂ ≡ t a a g t g t c a
The edit-string is "imsmmssdmmm". This sequence of operations transforms \(\sigma_1\) into \(\sigma_2\) (or vice versa, depending on interpretation).
Cost of Operations: Each edit operation can be assigned a cost. The cost reflects the "penalty" for using that operation. Common cost assignments are:
match (\(m\)): cost 0 (no penalty, or sometimes a reward/negative cost is used to maximize score instead of minimizing cost).
insert (\(i\)), delete (\(d\)), substitute (\(s\)): cost 1 (a penalty of 1 for each operation). Other costs are also possible.
The total cost of an alignment is the sum of the costs of all operations in the edit-string. The goal is often to find an alignment with the minimum total cost.
Key Problems in String Alignment
Given two strings, the main problems in string alignment are:
1. Find an alignment. 2. Establish if the found alignment is optimal.
Global vs. Local Alignment Approaches
Global Alignment: Align entire sequences.
Local Alignment: Find best matching subsequences.
Distance Metrics for String Alignment
The cost of an alignment can be viewed as a distance metric.
Levenshtein Distance: Edit Distance
Definition 5 (Levenshtein Distance). The Levenshtein Distance is the minimum number of edits (insertions, deletions, substitutions) to transform one string into another, with each edit costing 1. Matches cost 0.
For example, \(d_L(\text{kitten}, \text{sitting}) = 3\).
Hamming Distance: Substitution Distance
Definition 6 (Hamming Distance). The Hamming Distance is the number of positions with differing characters for strings of equal length. Only substitutions are counted; insertions and deletions are not allowed.
For example, \(d_H(\text{ACTG}, \text{AGTC}) = 2\). Hamming distance is undefined for strings of unequal length.
Exercise: For equal length strings \(\sigma_1, \sigma_2\), prove that \(d_L(\sigma_1, \sigma_2) \leq d_H(\sigma_1, \sigma_2)\).
Conclusion
This lecture introduced Lowest Common Ancestor (LCA) in trees and String Alignment. We discussed theorems for efficient LCA computation and constant-time retrieval algorithms (though details require further clarification). For string alignment, we defined edit strings, Levenshtein and Hamming distances, and global vs. local alignment. These are foundational concepts for advanced algorithm design and bioinformatics.
Exercises
Provide a more detailed proof for Lemma 1.
Elaborate on pre-processing for constant-time LCA retrieval.
Implement the constant-time LCA retrieval algorithm (Algorithm [alg:constant_time_lca]).
Calculate Levenshtein and Hamming distances for:
\(\sigma_1 = \text{kitten}\), \(\sigma_2 = \text{sitting}\)
\(\sigma_1 = \text{ACTG}\), \(\sigma_2 = \text{AGTC}\)
For which pairs is Hamming distance defined?
Prove that for equal length strings \(\sigma_1, \sigma_2\), \(d_L(\sigma_1, \sigma_2) \leq d_H(\sigma_1, \sigma_2)\).
Design a dynamic programming algorithm to compute Levenshtein distance.
Remark. Remark 7. Hamming distance is only defined for strings of equal length. For strings of unequal length, we cannot directly compare character positions in a one-to-one manner to count differences.