Section 9-3: Principle of the Analytical Comparison of Two Sequences

[ Previous chapter ][ This chapter ][ Next chapter ] The problem of the schematic alignment is its lacking ability to deal with gaps - elements which have no counterpart in the opposite sequence. Mathematics become difficult as we introduce this element with two rather than one property:


Subsection 9.3.1

Motivation

If we try this without two sample sequences as already known from the schematic comparison with

 

  
            Match value:  2
  
         Mismatch value: -1
  
            Gap penalty: -5
  
     Gap length penalty: -1
  

  
We can improve some of our alignments. For technical reasons of printability, the two sequences are shortened to tgatggtcaagtaaactatgaag and atggtaatggcacaattgacttt (which does not affect the principle). Examples of alignments could be
 
tgatggtcaa.gtaaactat.gaag       score: 14*2+5*(-1)+4*(-5)+4*(-1) = -1
  
  ||||| || |   || || ||                14 match 5 mismatch       
  
  atggt.aatg.gcacaattgacttt             4 gaps 4 gap length 
  

  
   tgatggtcaagtaaactatgaag       score:9*2+11*(-1)+1*(-5)+4*(-1) = -2
  
   || |  |   |  |    |||                9 match 11 mismatch       
  
  atggtaat...ggcacaattgacttt            1 gap 3 gap length 
  
The assignment of values is fairly arbitrary and allows for significant changes in the result. If we take the standard values of the gap program (as detailed below)
 

  
            Match value:  1
  
         Mismatch value:  0
  
            Gap penalty: -5
  
     Gap length penalty: -0.3
  

  
we end up with the following calculation:
 

  
tgatggtcaa.gtaaactat.gaag       score: 14*1+5*(0)+4*(-5)+4*(-0.3) = -7.2
  
  ||||| || |   || || ||                14 match 5 mismatch       
  
  atggt.aatg.gcacaattgacttt             4 gaps 4 gap length 
  

  
   tgatggtcaagtaaactatgaag       score:9*1+11*(0)+1*(-5)+4*(-0.3) =  2.8
  
   || |  |   |  |    |||                9 match 11 mismatch       
  
  atggtaat...ggcacaattgacttt            1 gap 3 gap length 
  
As these two examples show, the insertion and modification of gaps allows for a very broad variation of possibilities. Additionally, the judgement of the two possible alignments is possibly not satisfactory as we have not evaluated other possible alignments of the two sequences. In particular, the optimisation of the score at a given set of parameters, i.e. the calculation of best alignment, would require to try out all possible combinations, which is a fairly time- consuming task. An automatic procedure, therefore, is required to evaluate the possible solutions extensively. The algorithm for this purpose was first described by Smith and Watermann and usually implemented in a dynamic programming approach.


Subsection 9.3.2

Letter-by-Letter Alignment Prerequisites

We need a defined scoring table, which defines values for

More advanced implementations also use a parameter to define the change of the gap elongation penalty, which results in the creation of "affine gaps":

The values for a match must not necessarily be characterised by a single value. Amino acid comparisons, in particular, need to reflect the relation between two amino acids in more detail than a yes/no decision. The value for this comparison is read from a symbol comparison table which reflects either evolutionary or more pragmatic interpretations of symbol values.


Subsection 9.3.3

Symbol Comparison Tables

Nucleic acid sequence alignment will succeed if the scoring proceeds according to a simple match/mismatch schema. Sophisticated approaches will include ambiguity tables and score a match of G to S as half the match score - S is the ambiguity letter for G or C. However, most DNA sequences follow a 4-letter alphabet and, therefore, will be satisfactorily handled with straightforward matching score calculation.

Protein comparisons are more sophisticated. One option to compare amino acids by property is to set up a classification for

etc.

As it is difficult to quantify these properties, measurable values could be used like

etc.

The comparison tables which are generated this way might be based on correct values, however, miss the target as the value of a comparison matrix will be determined by the potential to follow the basic paradigm which we imply in performing the alignment:

 

  
                DNA sequence 
  
       
  
       determines  | 
  
                   V
  
                                                              
  
                protein sequence 
  
                
  
       determines  |
  
                   V
  
                   
  
                secondary structure 
  
                
  
       determines  |
  
                   V
  
                   
  
                tertiary structure (protein)
  
                
  
       determines  |
  
                   V
  
                   
  
                protein function
  

  
Two approaches have been used successfully in standard implementations of biocomputing software:

Many more matrices have been proposed but are not currently widely used.

NOTE: It is essential to understand the impact of the symbol comparison matrix for the result of the calculation; in particular as the statistics or numerical values of results will reflect the underlying matrix. "Good" or "meaningful" alignments as numerically computed by a program are only as good as the algorithm, but the relevance for biology is also affected by the relevance or applicability of the matrix used for comparison.


Subsection 9.3.4

Alignment Path Matrices

NOTE:

The following description tries to explain the method but does not show the real implementation for reasons of simplicity and brevity. Please refer to the original papers for further reference. Advanced users should skip this section and proceed reading the "programs"section .

The basic principle of an alignment path matrix is the stepwise creation of values. Figure E shows the initial step: As in the 'dotplot' program, sequences were painted in a crossword-like fashion. Instead of a dot, the match or mismatch value is printed.

Next, additional values are calculated by adding the value of the current field (match or mismatch value) in addition to the value of an alignment path seen before arriving at this step.

Figure F shows the alignment path matrix in an intermediate stage. Each new value is printed as the minimum of one of several possibilities. To get to the value (X) in the Figure, one could use one of several possible pathways:

 

  
                              one gap: 1 (previous) -5 (gap) -1 (mismatch)
  
                             /                                    = -5
  
                            / direct: -3 (previous) -1 (mismatch) = -4 
  
     +------+------+------+// one gap: 0 (previous) -5 (gap) -1 (mismatch)
  
  c  |  -1  |  -2  |   0  /// one longer gap:                     = -6
  
     +------+------______////----+     6 (previous) -5 (gap) -1 (gap length) 
  
  t  |  -1  |   1 /|  -3 /// -1  |                  -1 (mismatch) = -1
  
     +------+------+-----||------+
  
  g  |  -1  |  -2  |   0/ |   8  |
  
     +------+------+-----/+------+     min of (-5, -4, -6, -1)    = -1
  
  g  |  -1  |  -2  |   6/ |  -1  |     value printed, therefore:    -1
  
     +------+------+------+------+
  
  ...|      |      |      |
  
         a      t      g      g 
  

  
To compute all beneficial fields quickly, some implementations do not compute "chanceless" pathways and reduce, therefore, memory and time requirements. This, typically, applies to the edges of sequence alignment path matrices (in our example, the upper left and lower right corner). The GCG program implementation claims a "value not guaranteed to be optimal" but usually the edges do not yield a much better value.



Figure E: Alignment path matrix, step 1 Scores are computed as outlined in the text.

 
  g -1
  
  a  2
  
  a  2
  
  g -1
  
  t -1
  
  a  2
  
  t -1
  
  c -1
  
  a  2
  
  a  2
  
  a  2
  
  t -1
  
  g -1
  
  a  2
  
  a  2
  
  c -1
  
  t -1
  
  g -1
  
  g -1
  
  t -1
  
  a  2
  
  g -1
  
  t -1  2 -1 -1  2 -1 -1  2 -1 -1 -1 -1 -1 -1 -1  2  2 -1 -1 -1  2  2  2
  
     a  t  g  g  t  a  a  t  g  g  c  a  c  a  a  t  t  g  a  c  t  t  t
  

  
Figure F: Alignment path matrix in an advanced stage. Scores are calculated as described in the text using the following values: direct: score + previous. score value = 2 gap: score + previous -5 mismatch = -1 gap length = -1 Note the (X) which is explained in detail in the text.
 
  g -1  1  3
  
  a  2  1 -3
  
  a  2 -2 -3
  
  g -1 -2  6
  
  t -1  4 -3
  
  a  2 -2  0 
  
  t -1  1  0 
  
  c -1  1  0 
  
  a  2  1  0 
  
  a  2  1 -3
  
  a  2 -2  0
  
  t -1  1  0
  
  g -1  1  3
  
  a  2  1 -3
  
  a  2 -2 -3
  
  c -1 -2  0 (X)
  
  t -1  1 -3 -1
  
  g -1 -2  0  8
  
  g -1 -2  6 -1
  
  t -1  4 -3 -4
  
  a  2 -2 -3  3  0  0  3 -3 -3  3  0  0 -3  0  0 -3 -3  0  6 -3 -3 -3  0
  
  g -1 -2  4  1 -2  1 -2 -2  4  1 -2 -2 -2 -2 -2 -2  1  4 -2 -2 -2  1  1
  
  t -1  2 -1 -1  2 -1 -1  2 -1 -1 -1 -1 -1 -1 -1  2  2 -1 -1 -1  2  2  2
  
     a  t  g  g  t  a  a  t  g  g  c  a  c  a  a  t  t  g  a  c  t  t  t
  

  

  
Figure G:
Alignment path matrix in an completed stage. Scores are calculated as described in the text using values listed in Figure E. Note the positive numbers of the two most promising alignments.
 
  g -1  1  3 -1 -1  3  1  3  3 -2  0 -1 -1  2  0  4  5  7  4 11 12  6  3
  
  a  2  1 -3  0  4  2  4  1 -4 -2  0  0  3  1  5  6  5  5 12 13  7  4  3
  
  a  2 -2 -3  5  0  2  2 -4 -1  1 -2  4 -1  3  7  3  6 10 14  6  5  4  4
  
  g -1 -2  6 -1 -2  0 -3  1  2 -1  2  0  1  5  2  5 11 12  7  6  5  5  6
  
  t -1  4 -3 -1  1 -3  2  0 -3  0  1  0  6  3  6 12 10  8  7  6  5  7 13
  
  a  2 -2  0 -1 -2  3 -2 -2  1  2  1  7  4  7 10  8  9  8  7  3  5 11 10
  
  t -1  1  0 -1  1 -4 -4  2  3  2  5  5  5  8  9 10  9  5  1  4 12 11  6
  
  c -1  1  0 -1 -5 -3 -1  4  3  6  6  3  9 10  8  7  6  2  2 11  9  4  3
  
  a  2  1  0 -4 -2  0  5  4  4  1  4  7 11  9  8  7  3  3  9 10  5  4  3
  
  a  2  1 -3 -1 -2  3  5  5  2  5  5 12  7  6  8  4  3  7 11  6  2  1  4
  
  a  2 -2  0 -1  1  3  6  3  6  6 10  8  4  6  5  4  8  9  7  3 -2  5  4
  
  t -1  1  0  2  1  4  4  7  7 11  6  5  4  3  5  9 10  2  4 -1  3  2 -1
  
  g -1  1  3 -1 -1  5  5  8 12  7  3  2  4  6  7  8  3  5 -2  1  0 -3  1
  
  a  2  1 -3 -4  0  6  8 10  5  4  2  5  7  8  9  2  0 -1  2  1 -2  2 -3
  
  a  2 -2 -3 -1  1  7 11  3  2  3  3  8  6  7  3 -1 -2  0  2 -4  3 -2 -2
  
  c -1 -2  0 -1  2  9  4  3  4 -1  6  7  5  1  0 -1  1  0 -3  4 -2 -1 -1
  
  t -1  1 -3 -1 10  2  1  5 -2  1  8  3  2  1  0  2  1 -3  2 -1  1  1 -2
  
  g -1 -2  0  8  0 -1  3 -2  2  9  1  0  0 -2 -3 -4 -3  3  0 -1 -1 -3  1
  
  g -1 -2  6 -1 -3  4 -1 -2  7  2 -2  1 -2 -2 -5 -2  1  1 -5  0  4  2  1
  
  t -1  4 -3 -4  5 -1 -1  5 -3 -2  2 -1 -1 -4 -1  2 -1 -4 -1  5  3  2 -4 
  
  a  2 -2 -3  3  0  0  3 -3 -3  3  0  0 -3  0  0 -3 -3  0  6 -2 -3 -3  0
  
  g -1 -2  4  1 -2  1 -2 -2  4  1 -2 -2 -2 -2 -2 -2  1  4 -2 -2 -2  1  1
  
  t -1  2 -1 -1  2 -1 -1  2 -1 -1 -1 -1 -1 -1 -1  2  2 -1 -1 -1  2  2  2
  
     a  t  g  g  t  a  a  t  g  g  c  a  c  a  a  t  t  g  a  c  t  t  t
  

  
Figure H:
Alignment path matrix in an completed stage. Scores are calculated as described in the text using values listed in Figure E. Note the positive numbers of the two most promising alignments.
 
  g                                                             12  
  
  a                                                          13/
  
  a                                                       14/
  
  g                                                 ___12/
  
  t                                             12/                   13
  
  a                                           10/                  11/
  
  t                                           |                 12/   
  
  c                                        10/              11/  
  
  a                                     11/                  |   
  
  a                                  12/                  11/ 
  
  a                               10/                   9/   
  
  t                            11/                  10/  
  
  g                         12/                   8/ 
  
  a                      10/                   9/  
  
  a                   11/                ___7/  
  
  c                 9/                7/  
  
  t             10/                8/  
  
  g           8/                9/  
  
  g        6/                7/  
  
  t     4/                5/ 
  
  a  2/                3/  
  
  g                 1/  
  
  t              2/  
  
     a  t  g  g  t  a  a  t  g  g  c  a  c  a  a  t  t  g  a  c  t  t  t
  

  
The final result of the alignments is painted in Figure G.
The evaluation of the "best" alignment is achieved by seeking the highest value or in the edges, and processing along degressing score, starting (in our example) at the upper right moving downwards. Figure H shows this exemplarically for the two best diagonals found. In a letter-by-letter alignment, these two diagonals read as follows:
 

  
tgatggtcaagtaaac.tatgaag                 tgatgg.tcaagtaaactatgaag 
  
  ||||| | |  |   |  ||                   | ||||  ||| |  ||| |  
  
  atggtaatggcacaatt.gacttt           atggtaatggcacaatt.gacttt   
  
  
  

[
previous chapter ],[ this chapter ][ next chapter ] , [next page/section] , or [overview] , or [table of contents]