Sequence Alignment: Cost, Scoring, and Dynamic Programming

Author

Your Name

Published

February 8, 2025

“’latex

Introduction

Sequence alignment is a cornerstone technique in bioinformatics and computational biology. It involves arranging sequences of DNA, RNA, or proteins to pinpoint regions of similarity, which often reflect functional, structural, or evolutionary relationships. This lecture introduces the fundamental concepts of cost and scoring in sequence alignment and explores dynamic programming as a powerful method for global alignment. We will cover:

  • Terminology related to sequence alignment.

  • The concepts of cost and score for evaluating alignments.

  • Scoring matrices like BLOSUM.

  • Gap penalties, including affine gap penalties.

  • Distance metrics such as Levenshtein and Hamming distance.

  • Dynamic programming for global sequence alignment.

By the end of this lecture, you will understand how to quantify the quality of a sequence alignment and the basic principles behind dynamic programming approaches to find optimal global alignments.

Terminology and Significance: Cost and Score in Alignments

When aligning two biological sequences, we aim to find the optimal series of operations to transform one sequence into the other. These operations, which form an edit-string, typically include:

  • Match: Corresponding characters in the alignment are identical.

  • Substitution (or Mismatch): Corresponding characters are different.

  • Insertion: Introducing a in the first sequence to correspond with a character in the second.

  • Deletion: Introducing a in the second sequence to correspond with a character in the first (deletion in one sequence is equivalent to insertion in the other).

To assess the quality of an alignment, we need a method to quantify its "cost" or "score."

Ordering Alignments: Cost and Score

We can assign a cost to each edit operation within an edit-string , which represents the sequence of operations transforming one string into another. When using costs, the goal is to minimize the total cost of the alignment. As indicated in the lecture slides, we aim to minimize the cost (vogliamo \(\downarrow\)).

Alternatively, we can assign a score to each operation. In this approach, the objective is to maximize the total score of the alignment. As indicated in the lecture slides, we aim to maximize the score (vogliamo \(\uparrow\)).

The choice between using cost or score depends on the application and field conventions. Minimizing cost and maximizing score are often inversely related. For instance, matches might receive a low cost (or high score), while mismatches, insertions, and deletions receive a high cost (or low score).

Assigning a Score

Let’s define a scoring system for alignment operations. A common strategy is to reward matches and penalize mismatches and gaps, reflecting biological intuition that matches are more favorable than mismatches or gaps in evolutionary contexts.

  • Match (Good): Assign a positive, high score. Example: +1.

  • Substitution/Mismatch (Bad): Assign a zero or negative, low score. Examples: 0 or -1.

  • Insertion/Deletion/Gap (Bad): Assign a zero or negative, low score. Examples: 0 or -1.

As noted in the lecture slides, matches are considered "good" and assigned a high score, while substitutions and indels are "bad" and assigned low scores.

Example of a Scoring Matrix for Nucleotides:

For DNA sequences, we can use a scoring matrix to represent similarities between nucleotide pairs. Let’s consider a scenario where purines (Adenine (a), Guanine (g)) are considered similar, and pyrimidines (Cytosine (c), Thymine (t)) are also considered similar. A possible scoring matrix is:

\[\begin{array}{c|cccc} & a & c & g & t \\\hlinea & 2 & -1 & 1 & -1 \\c & -1 & 2 & -1 & 1 \\g & 1 & -1 & 2 & -1 \\t & -1 & 1 & -1 & 2 \\\end{array}\] In this matrix, matches between similar types (purine-purine or pyrimidine-pyrimidine, e.g., ‘a’ with ‘a’ or ‘c’ with ‘c’) receive a positive score (e.g., 2), while mismatches or matches between dissimilar types receive lower or negative scores (e.g., -1 or 1). This is a simple example; more sophisticated scoring matrices exist for biological sequences. As indicated in the lecture slides, this is a "non-banale" (non-trivial) scoring matrix example.

Matrix: An Example of a Cost Matrix

(Blocks of Amino Acid Substitution Matrix) matrices are widely used in protein sequence alignment. These substitution matrices are derived from observed alignments of protein families. matrices assign scores for substitutions of amino acids, reflecting the likelihood of these substitutions occurring in evolutionarily related proteins. Higher numbers (e.g., 62) are designed for more divergent sequences, while lower numbers (e.g., 80) are for more similar sequences. matrices are often presented as cost matrices, where lower values indicate more favorable substitutions (or higher scores, depending on the context).

The following is a fragment of a cost matrix (specifically, a portion of 62). Note that values in matrices are log-odds scores, which can be interpreted as costs or scores based on context and algorithm implementation. Here, we consider it a cost matrix where lower values are better:

\[\begin{array}{c|ccccccccccccccccc} & \text{Ala} & \text{Arg} & \text{Asn} & \text{Asp} & \text{Cys} & \text{Gln} & \text{Glu} & \text{Gly} & \text{His} & \text{Ile} & \text{Leu} & \text{Lys} & \text{Met} & \text{Phe} & \text{Pro} & \text{Ser} & \text{Thr} \\\hline\text{Ala} & 4 & -1 & -2 & -2 & 0 & -1 & -1 & 0 & -2 & -1 & -1 & -1 & -1 & -2 & -1 & 1 & 0 \\\text{Arg} & -1 & 5 & 0 & -2 & -3 & 1 & 0 & -2 & 0 & -3 & -2 & 2 & -1 & -3 & -2 & -1 & -1 \\\text{Asn} & -2 & 0 & 6 & 1 & -3 & 0 & 0 & 0 & 1 & -3 & -3 & 0 & -2 & -3 & -2 & 0 & 0 \\\text{Asp} & -2 & -2 & 1 & 6 & -3 & 0 & 2 & -1 & -1 & -3 & -4 & -1 & -3 & -3 & -1 & -1 & -2 \\\text{Cys} & 0 & -3 & -3 & -3 & 9 & -3 & -4 & -3 & -3 & -1 & -1 & -3 & -1 & -2 & -3 & -1 & -3 \\\text{Gln} & -1 & 1 & 0 & 0 & -3 & 5 & 2 & -2 & 0 & -3 & -2 & 1 & 0 & -2 & -1 & 0 & -1 \\\text{Glu} & -1 & 0 & 0 & 2 & -4 & 2 & 5 & -2 & 0 & -4 & -3 & 1 & -2 & -3 & -1 & 0 & -1 \\\text{Gly} & 0 & -2 & 0 & -1 & -3 & -2 & -2 & 6 & -2 & -4 & -3 & -2 & -3 & -3 & -2 & 0 & -2 \\\text{His} & -2 & 0 & 1 & -1 & -3 & 0 & 0 & -2 & 8 & -3 & -3 & -1 & -2 & -1 & -2 & -1 & -2 \\\text{Ile} & -1 & -3 & -3 & -3 & -1 & -3 & -4 & -4 & -3 & 4 & 2 & -3 & 2 & 0 & -3 & -2 & 1 \\\text{Leu} & -1 & -2 & -3 & -4 & -1 & -2 & -3 & -3 & -3 & 2 & 4 & -2 & 2 & 0 & -3 & -2 & -1 \\\text{Lys} & -1 & 2 & 0 & -1 & -3 & 1 & 1 & -2 & -1 & -3 & -2 & 5 & -1 & -3 & -1 & -1 & -1 \\\text{Met} & -1 & -1 & -2 & -3 & -1 & 0 & -2 & -3 & -2 & 2 & 2 & -1 & 5 & 0 & -2 & -1 & -1 \\\text{Phe} & -2 & -3 & -3 & -3 & -2 & -2 & -3 & -3 & -1 & 0 & 0 & -3 & 0 & 6 & -4 & -2 & -2 \\\text{Pro} & -1 & -2 & -2 & -1 & -3 & -1 & -1 & -2 & -2 & -3 & -3 & -1 & -2 & -4 & 7 & -1 & -2 \\\text{Ser} & 1 & -1 & 1 & -1 & -1 & 0 & 0 & 0 & -1 & -2 & -2 & -1 & -1 & -2 & -1 & 4 & 1 \\\text{Thr} & 0 & -1 & 0 & -2 & -1 & -1 & -1 & -2 & -2 & 1 & -1 & -1 & -1 & -2 & -2 & 1 & 5 \\\end{array}\] This is just a fragment of the full matrix, which includes scores for all 20 amino acids.

Gap Penalties: Gap Penalties

Gaps in sequence alignments often represent insertions or deletions that occurred during evolution. A simple constant penalty for each gap might not be biologically accurate. gap penalties are often more appropriate as they differentiate between the cost of opening a gap and the cost of extending it. As mentioned in the lecture slides, affine gap penalties are used to penalize l’apertura di un gap (gap opening) and estensione (gap extension).

gap penalties distinguish between:

  • Gap opening penalty: The cost incurred when starting a new gap.

  • Gap extension penalty: The cost for each position by which an existing gap is extended.

Typically, the gap opening penalty is significantly higher than the gap extension penalty. This reflects the biological reality that initiating a gap is a more substantial event than extending an existing gap.

Example:

Consider a gap opening penalty of -5 and a gap extension penalty of -1. The total penalty for a gap of length 3 is calculated as:

\[\text{Gap Penalty} = \text{Gap Opening Penalty} + (\text{Gap Length} - 1) \times \text{Gap Extension Penalty}\] \[\text{Gap Penalty} = -5 + (3 - 1) \times (-1) = -5 - 2 = -7\]

Choosing the Right Distance Metric

Selecting an appropriate distance metric or scoring scheme is crucial and depends on the characteristics of the sequences being aligned. As prompted in the lecture slides: Che distanza devo usare? (Which distance should I use?)

Rule of Thumb

  • Longer Strings \(\Leftrightarrow\) Distance: distance (edit distance) is generally more suitable for longer sequences where insertions and deletions are more frequent. It accounts for insertions, deletions, and substitutions, making it robust for sequences that may have undergone significant evolutionary changes. As noted in the lecture slides: Stringhe lunghe \(\Leftrightarrow\) Levensthein (Long strings \(\Leftrightarrow\) Levenshtein).

  • Shorter Strings \(\Leftrightarrow\) Distance: distance is more appropriate for comparing strings of equal length where substitutions are the primary type of difference. It only counts positions with differing symbols, ignoring insertions and deletions. It is computationally simpler and effective for closely related short sequences. As noted in the lecture slides: Stringhe corte \(\Leftrightarrow\) Hamming (Short strings \(\Leftrightarrow\) Hamming).

Definition 1 (Levenshtein Distance). The distance between two strings is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. It is also known as edit distance.

Definition 2 (Hamming Distance). The distance between two strings of equal length is the number of positions at which the corresponding symbols are different. It only considers substitutions and is applicable only to strings of the same length.

For global sequence alignment, especially when considering gaps and various scoring schemes, dynamic programming offers a powerful and widely used algorithmic approach.

Dynamic Programming for Global Alignment

Idea!

The fundamental idea behind dynamic programming is to solve a complex problem by breaking it down into smaller, overlapping subproblems. For sequence alignment, we can construct an optimal alignment step by step, character by character. As highlighted in the lecture slides: Idea! Se dobbiamo trovare una stringa (i.e. \(\eta\)) determiniamone un carattere alla volta. (Idea! If we need to find a string (i.e., \(\eta\)), let’s determine one character at a time.)

If we aim to find an optimal alignment, represented by an edit-string \(\eta\), we can determine it character by character.

Recursive Approach

Consider aligning two sequences \(\sigma_{1}\) of length \(m\) and \(\sigma_{2}\) of length \(n\). We want to compute the optimal alignment for prefixes \(\sigma_{1}[1...i]\) and \(\sigma_{2}[1...j]\). Let \(D(i, j)\) be the minimum cost (or maximum score) of aligning \(\sigma_{1}[1...i]\) and \(\sigma_{2}[1...j]\).

To decide the \((k+1)^{th}\) operation in the edit string \(\eta\), assuming we have already determined the first \(k\) operations, we consider possible operations for the \((k+1)^{th}\) step. As mentioned in the lecture slides: Immagino di aver determinato \(\eta[1, ..., k]\) e voglio \(\eta[k + 1]\). Diciamo che \(\eta[1, ..., k]\) allinei \(\sigma_{1}[1, ..., i - 1]\) e \(\sigma_{2}[1, ..., j - 1]\). 4 casi: \(\eta[k + 1] \in \{m, s, i, d\}\). (Imagine we have determined \(\eta[1, ..., k]\) and want \(\eta[k + 1]\). Assume \(\eta[1, ..., k]\) aligns \(\sigma_{1}[1, ..., i - 1]\) and \(\sigma_{2}[1, ..., j - 1]\). 4 cases: \(\eta[k + 1] \in \{m, s, i, d\}\).) Here, \(m\) denotes match, \(s\) substitution, \(i\) insertion, and \(d\) deletion.

Based on the lecture slides, the four cases are further detailed:

  1. Match/Substitution (m, s): If \(\eta[k+1]\) is a match or a substitution, then \(\eta[1, ..., k+1]\) aligns \(\sigma_{1}[1, ..., i]\) and \(\sigma_{2}[1, ..., j]\). We advance in both sequences. Se \(\eta[k + 1] \in \{m, s\}\), allora \(\eta[1, ..., k + 1]\) allinea \(\sigma_{1}[1, ..., i]\) e \(\sigma_{2}[1, ..., j]\).

  2. Insertion (i): If \(\eta[k+1]\) is an insertion (insert a gap in \(\sigma_{1}\)), then \(\eta[1, ..., k+1]\) aligns \(\sigma_{1}[1, ..., i-1]\) and \(\sigma_{2}[1, ..., j]\). We advance only in the second sequence (\(\sigma_{2}\)). Se \(\eta[k + 1] \in \{i\}\), allora \(\eta[1, ..., k + 1]\) allinea \(\sigma_{1}[1, ..., i - 1]\) e \(\sigma_{2}[1, ..., j]\).

  3. Deletion (d): If \(\eta[k+1]\) is a deletion (insert a gap in \(\sigma_{2}\)), then \(\eta[1, ..., k+1]\) aligns \(\sigma_{1}[1, ..., i]\) and \(\sigma_{2}[1, ..., j-1]\). We advance only in the first sequence (\(\sigma_{1}\)). Se \(\eta[k + 1] \in \{d\}\), allora \(\eta[1, ..., k + 1]\) allinea \(\sigma_{1}[1, ..., i]\) e \(\sigma_{2}[1, ..., j - 1]\).

We determine \(\eta[k+1]\) by selecting the operation that yields the minimum cost (or maximum score). This decision process leads to a recursive algorithm. As concluded in the slides: Determino \(\eta[k + 1]\) scegliendo quello che mi da il costo minimo. (I determine \(\eta[k + 1]\) by choosing the one that gives me the minimum cost.) This recursive approach is fundamental to dynamic programming solutions for sequence alignment.

Recursive Algorithm (Exponential Time Complexity)

Algorithm 1 presents a recursive approach to calculate the alignment cost. This algorithm, named allineamento_exp (exponential alignment), illustrates the recursive logic but is computationally inefficient due to exponential time complexity.

Algorithm 1: allineamento_exp\((\sigma_{1}, \sigma_{2}, i, j)\)

\(i + j\) \(c_{m} \leftarrow \Call{allineamento\_exp}{\sigma_{1}, \sigma_{2}, i - 1, j - 1}\) \(c_{m} \leftarrow \infty\) \(c_{s} \leftarrow \Call{allineamento\_exp}{\sigma_{1}, \sigma_{2}, i - 1, j - 1} + 1\) \(c_{i} \leftarrow \Call{allineamento\_exp}{\sigma_{1}, \sigma_{2}, i - 1, j} + 1\) \(c_{d} \leftarrow \Call{allineamento\_exp}{\sigma_{1}, \sigma_{2}, i, j - 1} + 1\) \(\min(c_{m}, c_{s}, c_{i}, c_{d})\)

In this simplified algorithm, we assume a cost of 0 for a match and a cost of 1 for substitution, insertion, and deletion. The base case is when we reach the beginning of either string. The cost of aligning a string with an empty string is simply the length of the non-empty string (all insertions or deletions).

This recursive algorithm has exponential time complexity due to redundant computations of the same subproblems. Dynamic programming optimizes this by storing and reusing the results of subproblems, leading to efficient algorithms like the Needleman-Wunsch algorithm for global alignment. As noted in the lecture slides, this recursive algorithm is esponenziale (exponential).

Conclusion

This lecture introduced fundamental concepts in sequence alignment, including cost and scoring. We explored different scoring schemes, such as scoring matrices and gap penalties, particularly gap penalties. We also introduced dynamic programming as a powerful strategy for efficient sequence alignment. The presented recursive algorithm provides a basic understanding of dynamic programming’s application, albeit with high computational cost. Future studies will delve into efficient dynamic programming algorithms like Needleman-Wunsch for global alignment and Smith-Waterman for local alignment. Key takeaways from this lecture include:

  • Understanding the distinction between cost minimization and score maximization in sequence alignment.

  • Familiarity with scoring matrices like and their application in protein alignment.

  • Knowledge of gap penalties, especially gap penalties, and their biological relevance.

  • Basic understanding of the dynamic programming approach for sequence alignment through a recursive algorithm.

Further exploration into dynamic programming will reveal more efficient algorithms that build upon these foundational concepts to solve sequence alignment problems in practical time frames.

Exercises

  1. Calculate the cost of aligning the sequences "GATTACA" and "TAGAC" using the recursive algorithm presented in Algorithm 1. Assume a match cost of 0 and a cost of 1 for substitution, insertion, and deletion.

  2. Consider the scoring matrix for Purines and Pyrimidines provided in Subsection 2.3. Calculate the score of aligning "AG" with "CT" using (a) only matches and mismatches, (b) allowing gaps with a gap penalty of -2.

  3. Explain the difference between cost minimization and score maximization in sequence alignment. Provide examples of scenarios where you might prefer one over the other.

  4. Why are gap penalties often more biologically relevant than constant gap penalties? Explain in terms of biological processes.

  5. (Further Study) Research and briefly describe the Needleman-Wunsch algorithm. How does it improve upon the recursive approach discussed in this lecture in terms of efficiency?