[ Previous chapter ][
This chapter ][ Next chapter ]
The problem of the schematic alignment is its lacking
ability
to deal with
gaps
- elements which have no counterpart in the opposite
sequence.
Mathematics become difficult as we introduce this element with
two
rather
than one property:
We could cope for the effect that
there is no counterpart letter by assigning a reasonable
bad score
such as a
gap penalty,
this value is typically 3 to 8 times higher
in value than the
best "match" value, and is subtracted from the total score.
The longer the gap is, the more unfavourable it will become.
We can compensate for overemphasising
gaps by telling the program
that we need an insertion penalty, but each elongation is scored
much weaker, as once a gap is there its length is less crucial
than the fact that there is
a gap at all. This parameter is called
gap length penalty.
If we try this without two sample sequences as already known
from the schematic comparison
with
We need a defined scoring table, which defines
values for
More advanced implementations also use a parameter to define the change
of the
gap
elongation penalty,
which results in the creation of
"affine gaps":
The values for a match must not necessarily be characterised by
a single value. Amino
acid comparisons, in particular, need
to reflect the relation between two amino acids in more
detail than
a yes/no decision. The value for this comparison is read from a
symbol
comparison table
which reflects either evolutionary or more pragmatic interpretations
of symbol values.
Nucleic acid sequence alignment will succeed if the scoring proceeds
according to a simple
match/mismatch schema. Sophisticated approaches
will include ambiguity tables and score a match
of
G
to
S
as half the match score - S is the ambiguity letter for G or C.
However, most DNA sequences
follow a 4-letter alphabet and, therefore,
will be satisfactorily handled with straightforward
matching score
calculation.
Protein
comparisons are more sophisticated. One option to compare amino
acids by property is to set
up a classification for
etc.
As it is difficult to quantify these properties, measurable values
could be used like
etc.
The comparison tables which are generated this way might be based
on correct values, however,
miss the target as the value of a
comparison matrix
will be determined by the potential to follow
the basic paradigm which we imply in performing
the alignment:
The analysis of globular protein structure evaluated with X-ray crystallography
allowed
Dayhoff
et al.
to set up a comparison matrix which honours
or penalises amino acid substitutions.
As proteins were found to allow exchange at certain positions,
whereas
other substitutions were never found, it was possible to postulate
a comparison matrix
based on the number of
accepted point mutations
(PAM 250), which is widely used. Note that the values therein are averages
which are calculated
on a limited number of proteins - the sequence
databases are much larger by today as compared
to the crystallographic
databases known at the time of the PAM250 matrix creation.
Based on the
PROSITE
database of Amos Bairoch, the alignments
of many protein motifs were used
by
Hennikoff et al.
to compute a matrix
which is commonly known as
BLOSUM62
in the most widely used variant. The benefit of this matrix is that
a very large
number of alignments was the basis for its creation,
which allowed more "sensitive" comparisons
than the PAM250 matrix.
Many more matrices have been proposed but are not currently widely used.
NOTE: It is essential to understand the impact of the symbol comparison
matrix for the
result of the calculation; in particular as the
statistics or numerical values of results will
reflect the underlying
matrix. "Good" or "meaningful" alignments as numerically computed
by a program are only as good as the algorithm, but the relevance
for biology is also affected
by the relevance or applicability of the
matrix used for comparison.
NOTE:
The following description tries to explain the
method but does not show the real implementation
for reasons of simplicity
and brevity. Please refer to the original papers for further reference.
Advanced users should skip this section and proceed reading the
"programs"section .
The basic principle of an alignment path matrix is the
stepwise
creation
of values.
Figure E
shows the initial
step: As in the 'dotplot' program, sequences were
painted in a crossword-like fashion. Instead
of a dot, the match or
mismatch value is printed.
Next, additional values are calculated by adding the value of the
current field (match or mismatch
value) in addition to the
value of an alignment path seen before arriving at this step.
Figure F
shows the alignment path matrix in an intermediate stage. Each new
value
is printed as the minimum of one of several possibilities. To get
to the value
(X)
in the Figure, one could use one of several possible pathways:
Figure E:
Alignment path matrix, step 1
Scores are computed as outlined
in the text.
Subsection 9.3.1 Motivation
Match value: 2
Mismatch value: -1
Gap penalty: -5
Gap length penalty: -1
We can improve some of our alignments. For
technical reasons of
printability, the two sequences are shortened to
tgatggtcaagtaaactatgaag
and
atggtaatggcacaattgacttt
(which does not affect the principle).
Examples of alignments could be
tgatggtcaa.gtaaactat.gaag score: 14*2+5*(-1)+4*(-5)+4*(-1) = -1
||||| || | || || || 14 match 5 mismatch
atggt.aatg.gcacaattgacttt 4 gaps 4 gap length
tgatggtcaagtaaactatgaag score:9*2+11*(-1)+1*(-5)+4*(-1) = -2
|| | | | | ||| 9 match 11 mismatch
atggtaat...ggcacaattgacttt 1 gap 3 gap length
The assignment of values is fairly arbitrary and allows for significant
changes in the
result. If we take the standard values of the
gap
program (as detailed below)
Match value: 1
Mismatch value: 0
Gap penalty: -5
Gap length penalty: -0.3
we end up with the following calculation:
tgatggtcaa.gtaaactat.gaag score: 14*1+5*(0)+4*(-5)+4*(-0.3) = -7.2
||||| || | || || || 14 match 5 mismatch
atggt.aatg.gcacaattgacttt 4 gaps 4 gap length
tgatggtcaagtaaactatgaag score:9*1+11*(0)+1*(-5)+4*(-0.3) = 2.8
|| | | | | ||| 9 match 11 mismatch
atggtaat...ggcacaattgacttt 1 gap 3 gap length
As these two examples show, the insertion and modification of gaps
allows for a very
broad variation of possibilities. Additionally,
the judgement of the two possible alignments
is possibly not
satisfactory as we have not evaluated other possible alignments of
the two
sequences. In particular, the
optimisation
of the score at a given set of
parameters, i.e. the calculation of best alignment,
would require to try out all
possible
combinations, which is a fairly time- consuming task. An
automatic
procedure,
therefore, is required to evaluate the possible solutions
extensively. The algorithm for this
purpose was first described by
Smith and Watermann
and usually implemented
in a
dynamic programming
approach.
Subsection 9.3.2 Letter-by-Letter Alignment Prerequisites
Subsection 9.3.3 Symbol Comparison Tables
DNA sequence
determines |
V
protein sequence
determines |
V
secondary structure
determines |
V
tertiary structure (protein)
determines |
V
protein function
Two approaches have
been used successfully in standard implementations
of biocomputing software:
Subsection 9.3.4 Alignment Path Matrices
one gap: 1 (previous) -5 (gap) -1 (mismatch)
/ = -5
/ direct: -3 (previous) -1 (mismatch) = -4
+------+------+------+// one gap: 0 (previous) -5 (gap) -1 (mismatch)
c | -1 | -2 | 0 /// one longer gap: = -6
+------+------______////----+ 6 (previous) -5 (gap) -1 (gap length)
t | -1 | 1 /| -3 /// -1 | -1 (mismatch) = -1
+------+------+-----||------+
g | -1 | -2 | 0/ | 8 |
+------+------+-----/+------+ min of (-5, -4, -6, -1) = -1
g | -1 | -2 | 6/ | -1 | value printed, therefore: -1
+------+------+------+------+
...| | | |
a t g g
To compute all beneficial fields quickly, some implementations
do not compute "chanceless"
pathways and reduce, therefore, memory and
time requirements. This, typically, applies to the
edges of sequence
alignment path matrices (in our example, the upper left and lower right
corner).
The GCG program implementation claims a "value
not guaranteed to be optimal" but
usually the edges do not yield a much
better value.
g -1
a 2
a 2
g -1
t -1
a 2
t -1
c -1
a 2
a 2
a 2
t -1
g -1
a 2
a 2
c -1
t -1
g -1
g -1
t -1
a 2
g -1
t -1 2 -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 -1 -1 2 2 -1 -1 -1 2 2 2
a t g g t a a t g g c a c a a t t g a c t t t
Figure F:
Alignment path matrix in an advanced stage. Scores are calculated
as described in the text using the following values:
direct: score + previous. score value
= 2
gap: score + previous -5 mismatch = -1 gap length = -1
Note the (X) which is explained
in detail in the text.
g -1 1 3
a 2 1 -3
a 2 -2 -3
g -1 -2 6
t -1 4 -3
a 2 -2 0
t -1 1 0
c -1 1 0
a 2 1 0
a 2 1 -3
a 2 -2 0
t -1 1 0
g -1 1 3
a 2 1 -3
a 2 -2 -3
c -1 -2 0 (X)
t -1 1 -3 -1
g -1 -2 0 8
g -1 -2 6 -1
t -1 4 -3 -4
a 2 -2 -3 3 0 0 3 -3 -3 3 0 0 -3 0 0 -3 -3 0 6 -3 -3 -3 0
g -1 -2 4 1 -2 1 -2 -2 4 1 -2 -2 -2 -2 -2 -2 1 4 -2 -2 -2 1 1
t -1 2 -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 -1 -1 2 2 -1 -1 -1 2 2 2
a t g g t a a t g g c a c a a t t g a c t t t
Figure G:
Alignment path matrix in an completed
stage. Scores are calculated
as described in the text using values listed in Figure E.
Note
the positive numbers of the two most promising alignments.
g -1 1 3 -1 -1 3 1 3 3 -2 0 -1 -1 2 0 4 5 7 4 11 12 6 3
a 2 1 -3 0 4 2 4 1 -4 -2 0 0 3 1 5 6 5 5 12 13 7 4 3
a 2 -2 -3 5 0 2 2 -4 -1 1 -2 4 -1 3 7 3 6 10 14 6 5 4 4
g -1 -2 6 -1 -2 0 -3 1 2 -1 2 0 1 5 2 5 11 12 7 6 5 5 6
t -1 4 -3 -1 1 -3 2 0 -3 0 1 0 6 3 6 12 10 8 7 6 5 7 13
a 2 -2 0 -1 -2 3 -2 -2 1 2 1 7 4 7 10 8 9 8 7 3 5 11 10
t -1 1 0 -1 1 -4 -4 2 3 2 5 5 5 8 9 10 9 5 1 4 12 11 6
c -1 1 0 -1 -5 -3 -1 4 3 6 6 3 9 10 8 7 6 2 2 11 9 4 3
a 2 1 0 -4 -2 0 5 4 4 1 4 7 11 9 8 7 3 3 9 10 5 4 3
a 2 1 -3 -1 -2 3 5 5 2 5 5 12 7 6 8 4 3 7 11 6 2 1 4
a 2 -2 0 -1 1 3 6 3 6 6 10 8 4 6 5 4 8 9 7 3 -2 5 4
t -1 1 0 2 1 4 4 7 7 11 6 5 4 3 5 9 10 2 4 -1 3 2 -1
g -1 1 3 -1 -1 5 5 8 12 7 3 2 4 6 7 8 3 5 -2 1 0 -3 1
a 2 1 -3 -4 0 6 8 10 5 4 2 5 7 8 9 2 0 -1 2 1 -2 2 -3
a 2 -2 -3 -1 1 7 11 3 2 3 3 8 6 7 3 -1 -2 0 2 -4 3 -2 -2
c -1 -2 0 -1 2 9 4 3 4 -1 6 7 5 1 0 -1 1 0 -3 4 -2 -1 -1
t -1 1 -3 -1 10 2 1 5 -2 1 8 3 2 1 0 2 1 -3 2 -1 1 1 -2
g -1 -2 0 8 0 -1 3 -2 2 9 1 0 0 -2 -3 -4 -3 3 0 -1 -1 -3 1
g -1 -2 6 -1 -3 4 -1 -2 7 2 -2 1 -2 -2 -5 -2 1 1 -5 0 4 2 1
t -1 4 -3 -4 5 -1 -1 5 -3 -2 2 -1 -1 -4 -1 2 -1 -4 -1 5 3 2 -4
a 2 -2 -3 3 0 0 3 -3 -3 3 0 0 -3 0 0 -3 -3 0 6 -2 -3 -3 0
g -1 -2 4 1 -2 1 -2 -2 4 1 -2 -2 -2 -2 -2 -2 1 4 -2 -2 -2 1 1
t -1 2 -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 -1 -1 2 2 -1 -1 -1 2 2 2
a t g g t a a t g g c a c a a t t g a c t t t
Figure H:
Alignment path matrix in an completed
stage. Scores are calculated
as described in the text using values listed in Figure E.
Note
the positive numbers of the two most promising alignments.
g 12
a 13/
a 14/
g ___12/
t 12/ 13
a 10/ 11/
t | 12/
c 10/ 11/
a 11/ |
a 12/ 11/
a 10/ 9/
t 11/ 10/
g 12/ 8/
a 10/ 9/
a 11/ ___7/
c 9/ 7/
t 10/ 8/
g 8/ 9/
g 6/ 7/
t 4/ 5/
a 2/ 3/
g 1/
t 2/
a t g g t a a t g g c a c a a t t g a c t t t
The final result of the alignments is painted in
Figure G.
The evaluation of the "best" alignment is achieved by seeking the highest value or in the
edges, and processing along degressing score, starting (in our
example) at the upper right
moving downwards.
Figure H
shows this exemplarically for the two best diagonals
found.
In a letter-by-letter alignment, these two diagonals read as follows:
tgatggtcaagtaaac.tatgaag tgatgg.tcaagtaaactatgaag
||||| | | | | || | |||| ||| | ||| |
atggtaatggcacaatt.gacttt atggtaatggcacaatt.gacttt
[ previous chapter ],[
this chapter ][ next chapter ]
, [next page/section] , or [overview] , or [table of contents]