JAMF ArchiveBioCompanion as published in 1995THIS IS THE REFERENCE CODE AS PUBLISHED. Doelz, R. Optimal production of biological documentation: the JAM format. Comput. Applic. Biosci. 11, 224-226 (1995).The version you are currently viewing is the one printed and distributed via the Internet from the server of BioComputing Basel. Version 3.1 of the BioCompanion was published with version 2 of the JAMF software. The server that was indicated in the documentation has ceased to exist. Version 3.2 of the BioCompanion was not publicly available for free but was shareware that was distributed with GCG's software release 9. For the purpose of enhanced editing, JAMF was partially rewritten and the proprietary version 3.x of JAMF was used from 1996 onwards. The Biocompanion is available in a current version from the publisher . It has significantly changed both in software and content. |
|||
|
|
|
||
Let us assume the following two sequences:
One additional value is missing: If the two sequences have different
size, some symbols of one sequence will never have a counterpart.
The score of any symbol to "nothing" is therefore
assumed to be 0.
To get started, we will write the two sequences amongst each
other like in the painting above. However, we can either align
the beginning, the end, or arrange the sequences arbitrarily.
Figure A below shows our
two sequences shifted by various positions.
The score is determined according to the table
above.
Figure A: Sequence alignments produced by shifting.
Scores are calculated using a Match value of 1 and a mismatch
value of 0. Numbers in parenthesis refer to the calculation
with mismatch values of -0.5.
Figure A allows to conclude elementary findings:
The scoring table, if written as best-score listing of the
top four alignments, will read as:
This type of scoring will favour long alignments and
will produce higher scores the longer the alignments are. However,
the mismatches are not penalised, which implies that long stretches
of different sequences might be in the alignment. The result
will be that the score gets better if the alignment gets longer,
regardless the amount of mismatches encountered. In order to
discriminate better between similar sequences and those which
have accidental similarity on a long range of
symbols, such as expected in G/C rich sequences, we need to change
scoring to penalise mismatches. As an example,
we use the scoring
The dotplot method allows visual inspection
of all possible alignments in schematic fashion, and is shown
in Figures B
and C .
Figure B displays the very basic
dotplot: Two sequences are plotted as a matrix, and identical
symbols get an x (for technical reasons only,
in graphic output this is a "dot"). As the two sequences are
28 and 36 base pairs in length, we will have 28 x 36 = 1008 positions
to calculate. Our DNA alphabet is a 4-letter alphabet (as we
treat T and U as equal), which means that the random
chance of an identical symbol at any given position
is 1/4 = 0.25. Therefore, if the two sequences are totally unrelated,
we expect 0.25 x 1008 = 252 dots, or, as formula,
Figure B: Dot-plot created
by painting a dot (x) at each match. Obviously, we still need to improve the so-called signal
- to - noise ratio. The signal is a
line or otherwise visible pattern which we could use in the
visual inspection. The noise is what we can
expect from statistics: If we use a mathematical approximation
to count the probability in a four-letter alphabet, we expect
25% of a random hit probability, which is too high if we ask
for weak similarities - remember that the best score we had
was We need an improvement for the dotplots with
the word method:
In our example, on a length of 28 base pairs, we have only 15
matches which is approximately every second nucleotide. This
is a relatively weak signal. Therefore, we use a first approximation:
We do no longer point a dot in if each nucleotide matches, but
we use oligomers (called words) and paint a
dot if these words match. This reduces the
chance of a random match. If we use di-nucleotides, accidental
matches will be (1/4)*(1/4) = 1/16 = 6.25% which is already much
lower than the 25% obtained earlier. The GCG program suite uses
the default word size of 6
- this is (1/4) to the power of 6, which results in a random
choice probability of 0.025%. There is, however, an undesired
side-effect: the larger the word size, the lower is the probability
that a given word matches in between two different sequences.
Expressed as formula, this will read
Figure C: Dot-plot painting a dot (o) at each
matching di-nucleotide - more details are described in the text.
The problem with the sensitivity (too few subsequent identities
in the case of low similarity) can be overcome with the permission
of mismatches in a word. This is the already
known window technique :
We select a window and request that a minimal
number of matches within this window is obtained. The GCG programs
call this a stringency. Therefore, using a
window/stringency algorithm, we will be able
to paint dots at the middle of a window rather
than a word, which means that, given the values 9/5 for
window/stringency, we obtain the plot as shown
in Figure D.
Figure D:
Dot-plot with a dot (0) at each matching window of 9 - with a
minimum of 5 matches (stringency) per window.
Looking at Figure D, two main diagonals can
be identified: We will assume that all these
setup operations have been successfully completed. Note
that the methods described below are valid for both DNA
and protein sequences.
compare
calculates the dots to be displayed later , comes
with two different algorithms; the "window/stringency" as default.
You might try compare/word for really large
sequences.
dotplot
displays dots calculated by compare, and will look nicer if you
use additional options like the following. For an overview of
possible options, use dotplot with the "check" option.
Recommendation for the 'dotplot' program:
Nicer figures will be obtained if you give the following command
line options. If you use WPI, make
sure that the command line options are ticked as indicated below.
dotplot /tickaxes /symbol=2 /font=3
================================= Begin Exercise 8
Schematic pairwise DNA analysis: Compare two sequences
using the 'dotplot' technique.
Using previous exercise results, you should have two DNA sequences
by now: my1.seq as the typed-in sequence, and
my2.seq as the reading-frame extracted DNA sequence
from the seqed exercise. You shall compare these
two sequences now.
To solve this problem, follow this schedule:
================================= End Exercise 8
It is important to know as much about the sequence of interest
as possible. The dotplot as explained above
may be used to analyse sequences internally, i.e.,
you may compare a sequence against itself. The dotplot as such
becomes perfectly symmetrical. The gcg implementation of the
'dotplot' program will recognise that the sequences on both axes
are identical and will, therefore, plot only half of the sequence.
You might want to force a full display with
$ dotplot /all
The benefit of an internal repeat analysis will be obvious
if you encounter gene duplication or the occurrence
of several functional protein motifs.
================================= Begin Exercise 9
Internal repeat analysis: Analyse a single sequence using
the 'dotplot' technique in order to find internal repeats.
Using previous exercise results, you should have a database
DNA sequence my2.seq, and the translated sequence,
m19311.pep as peptide. You shall compare these
two sequences now with itself on each DNA and protein level,
and compare the results. Note that, in particular on protein
level, the adjustment of window and stringency values might
be a lengthy process.
To solve this problem, follow this schedule: Tip for the Superimposition of Dotplots: All
gcg graphics routines offer a "density" to plot the output. If
you do not accept the default value but use a number dividable
by three for DNA, and the corresponding number for the protein
comparison.
================================= End Exercise 9
The problem of the schematic alignment is its lacking ability
to deal with gaps - elements which have no counterpart
in the opposite sequence. Mathematics become difficult as we
introduce this element with two rather than
one property:
We could cope for the effect that there is no counterpart letter
by assigning a reasonable bad score such as a gap penalty,
this value is typically 3 to 8 times higher in value
than the best "match" value, and is subtracted from the total
score.
The longer the gap is, the more unfavourable it will become.
We can compensate for overemphasising gaps by telling the program
that we need an insertion penalty, but each elongation is scored
much weaker, as once a gap is there its length is less crucial
than the fact that there is a gap at all. This parameter is called
gap length penalty.
If we try this without two sample sequences as already known
from the schematic comparison with We need a defined scoring table, which defines values for
More advanced implementations also use a parameter to define
the change of the gap elongation penalty, which
results in the creation of "affine gaps": The values for a match must not necessarily be characterised
by a single value. Amino acid comparisons, in particular, need
to reflect the relation between two amino acids in more detail
than a yes/no decision. The value for this comparison is read
from a symbol comparison table which reflects
either evolutionary or more pragmatic interpretations of symbol
values.
Nucleic acid sequence alignment will succeed if the scoring
proceeds according to a simple match/mismatch schema. Sophisticated
approaches will include ambiguity tables and score a match of
G to S as half the match score
- S is the ambiguity letter for G or C. However, most DNA sequences
follow a 4-letter alphabet and, therefore, will be satisfactorily
handled with straightforward matching score calculation.
Protein comparisons are more sophisticated.
One option to compare amino acids by property is to set up a
classification for etc.
As it is difficult to quantify these properties, measurable values
could be used like etc.
The comparison tables which are generated this way might be based
on correct values, however, miss the target as the value of a
comparison matrix will be determined by the potential to follow
the basic paradigm which we imply in performing the alignment:
The analysis of globular protein structure evaluated with X-ray
crystallography allowed Dayhoff et al. to set
up a comparison matrix which honours or penalises amino acid
substitutions. As proteins were found to allow exchange at certain
positions, whereas other substitutions were never found, it was
possible to postulate a comparison matrix based on the number
of accepted point mutations (PAM 250), which
is widely used. Note that the values therein are averages which
are calculated on a limited number of proteins - the sequence
databases are much larger by today as compared to the crystallographic
databases known at the time of the PAM250 matrix creation.
Based on the PROSITE database of Amos Bairoch,
the alignments of many protein motifs were used by Hennikoff
et al. to compute a matrix which is commonly known as
BLOSUM62 in the most widely used variant. The
benefit of this matrix is that a very large number of alignments
was the basis for its creation, which allowed more "sensitive"
comparisons than the PAM250 matrix. Many more matrices have been proposed but are not currently
widely used.
NOTE: It is essential to understand the impact of the
symbol comparison matrix for the result of the calculation; in
particular as the statistics or numerical values of results will
reflect the underlying matrix. "Good" or "meaningful" alignments
as numerically computed by a program are only as good as the
algorithm, but the relevance for biology is also affected by
the relevance or applicability of the matrix used for comparison.
NOTE:
The following description tries to explain the method but does
not show the real implementation for reasons of simplicity and
brevity. Please refer to the original papers for further reference.
Advanced users should skip this section and proceed reading the
"programs"section
. The basic principle of an alignment path matrix is the stepwise
creation of values. Figure E shows
the initial step: As in the 'dotplot' program, sequences were
painted in a crossword-like fashion. Instead of a dot, the match
or mismatch value is printed.
Next, additional values are calculated by adding the value of
the current field (match or mismatch value) in addition to the
value of an alignment path seen before arriving at this step.
Figure F shows the alignment path matrix in
an intermediate stage. Each new value is printed as the minimum
of one of several possibilities. To get to the value (X)
in the Figure, one could use one of several possible
pathways:
Figure E: Alignment path matrix, step 1 Scores
are computed as outlined in the text.
Figure G: Alignment path matrix in an completed
stage. Scores are calculated as described in the text using values
listed in Figure E. Note the positive numbers of the two most
promising alignments. Programs may distinguish between
the
Best suited for
comparing homologous sequences from different species, or similar
sequences with approximately the same length:
$ gap
Best suited
for comparing sequences discovered in searches, or sequences
with site homology rather than integral similarity.
$ bestfit
Best suited
for comparing sequences discovered in searches, or sequences
with site homology and a suspicious reading frame shift.
$ framealign
Text
publish can display alignments (DNA or protein)
in formatted fashion. The
EGCG version epublish can display the
translation of the second sequence in 1 or 3 letter code.
Graphic
Best suited for visualising
overlaps or regions of homology. Needs Graphics - remember to
have set the graphics environment with
setplot correctly
if you work with GCG locally.
X-Windows setups have to set the
DISPLAY environment correctly.
$ gap/out
or
$ bestfit/out
Next, display the graphics
using the ".out" files generated by the commands
given above.
$ gapshow
Preparation of Data Randomisation during Alignment
Best suited for estimating whether
the alignment produced is
(statistically) significant. Should be used with significantly
more than the default 10 randomisation's (try at least 50).
$ gap/ran=50
or
$ bestfit/ran=50
If you happen to have access to W.Pearson's sequence analysis
software you could try the rdf (or
rdf2 ) program. This requires that you first
convert the sequences of interest to STADEN format with
the program tostaden (or the program
readseq )
================================= Begin Exercise 10
Pairwise sequence analysis: Understand the use of comparison
matrices in the alignment procedure of protein sequences. Apply
different algorithms to the sequences obtained from DNA after
translation, and evaluate significance of the result on both
DNA and Protein level.
Using previous exercise results, you should have two DNA sequences
by now: my1.seq as the typed-in sequence, and
my2.seq as the reading-frame extracted DNA sequence
from the seqed exercise. You shall compare these
two sequences now on DNA and protein level. If you haven't translated
the sequences to protein level already, you should do this now.
To solve this problem, follow this schedule:
================================= End Exercise 10
Schematic Comparison
Principle of Sequence Alignment
My1.seq tgatggtcaagtaaactatgaagagttt
unknown seq atggtaatggcacaattgactttcctgaatttctga
If we want to align those, we will try
to write the two sequences in a way which allows a pairwise comparison
of each sequence symbol. As you might guess, there are lots of
possible options to do so, and the longer the sequences are the
more options to align two sequences will exist. In order to find
the best alignment, we need to judge the quality
of the alignment. To allow computations and comparisons, this
judgement shall result in a numerical value, which is called
a score . The determination of this score relies
on a symbol comparison table, where each symbol
pairing gets a value assigned, in order to determine the overall
score by adding up the comparison value of each observed pair
in our alignment. These tables are very important in the protein
field, but also used in DNA comparison. A typical, simple scoring
table for nucleotides will give a value of 1 to
a match (treating U and T as "match"), and assign a value of
0 to each mismatch:
Match value: 1
Mismatch value: 0
+-----+-----+-----+-----+-----+-----+
| | A | G | C | T | U |
+-----+-----+-----+-----+-----+-----+
| A | 1 | 0 | 0 | 0 | 0 |
+-----+-----+-----+-----+-----+-----+
| G | 0 | 1 | 0 | 0 | 0 |
+-----+-----+-----+-----+-----+-----+
| C | 0 | 0 | 1 | 0 | 0 |
+-----+-----+-----+-----+-----+-----+
| T | 0 | 0 | 0 | 1 | 1 |
+-----+-----+-----+-----+-----+-----+
| U | 0 | 0 | 0 | 1 | 1 |
+-----+-----+-----+-----+-----+-----+
This matrix is perfectly symmetric and would be sufficient
if printed as half-populated table.
tgatggtcaagtaaactatgaagagttt
| | | || shift 4: score 5 (-4.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| || || || | | shift 3: score 9 (+1.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
||||| | | | || shift 2: score 10 (+2.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| | | | | | shift 1: score 6 (-5.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| | shift 0: score 2 (-11.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
|| | || | shift -1: score 6 (-5.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| | | | shift -2: score 4 (-8.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| || || shift -3: score 5 (-6.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| |||| | | ||| || ||| shift -4: score 15 (+8.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| ||| | | | | || shift -5: score 10 (+1.0)
atggtaatggcacaattgactttcctgaatttctga
Score Shift Length
------------------------
15 -4 29
10 2 27
-5 29
9 3 26
This means that one alignment with shift -4 is
calculated to be "best" but the alignments with
shift 2, -5 and 3 are of a similar score.
Match value: +1.0
Mismatch value: -0.5
and recalculate scores.
Figure A shows the values in parenthesis. The scoring
table, if written as best-score listing of the top four alignments,
will now read as:
Score Shift Length
------------------------
+8.5 -4 28
+2.0 2 26
+1.0 -5 28
3 25
The main benefit of this scoring schema is a quality
discrimination: All alignments which have twice as much
mismatches than matches will score negatively. This
implies that we can now introduce a threshold and
indicate a "reasonable" alignment to be of a "positive score".
However, we have not tried all of the possible shifts, and it
is not easily feasible to compare several kb of sequences this
way. Therefore, we need an automatism which allows to judge sequence
alignments after visual inspection. Principle of Dotplots
Number of possible dots =
(probability of pair) * (length of sequence A) * (length of Sequence B)
Counting the x in Figure B gives 278
dots, which is fairly close to the expected value.
t x x x x x x x x x x x x x
t x x x x x x x x x x x x x
__t x x x x x x x x x x x x x
25g x x x x x x x
a x x x x x x x x x x
g x x x x x x x
a x x x x x x x x x x
__a x x x x x x x x x x
20g x x x x x x x
t x x x x x x x x x x x x x
a x x x x x x x x x x
t x x x x x x x x x x x x x
__c x x x x x x
15a x x x x x x x x x x
a x x x x x x x x x x
a x x x x x x x x x x
t x x x x x x x x x x x x x
__g x x x x x x x
10a x x x x x x x x x x
a x x x x x x x x x x
c x x x x x x
t x x x x x x x x x x x x x
__g x x x x x x x
5 g x x x x x x x
t x x x x x x x x x x x x x
a x x x x x x x x x x
g x x x x x x x
t x x x x x x x x x x x x x
a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a
|5 |10 |15 |20 |25 |30 |35
Looking at Figure B, we can draw several
conclusions:
tgatggtcaagtaaactatgaagagttt
| |||| | | ||| || ||| shift -4: score 15 (+8.5)
atggtaatggcacaattgactttcctgaatttctga
Dotplot Principle - Improved
Number of possible dots =
(probability of word) * (length of sequence A) * (length of Sequence B)
The result of the application in case of a di-nucleotide
match (word size 2) is shown in Figure C:
In total, 65 dots (painted as o) are
painted in the view of (1008 x 0.0625) = 63 expected. In the
figure, dots (.) have been painted in suggestively in order to
show the position of the two best
hits obtained in the alignments displayed in
Figure A . The reason for the weak appearance is the low
similarity of the two sequences.
t o o o o o
t o o o o o
__t o
25g
a o o
g
a o o
__a o o o
20g o o o o
t o o o
a o o o o
t .o o o
__c o .o
15a o . o . o
a o . o . o
a . .
t o . .
__g . .
10a o. . o o
a . o. o
c . . o o
t .o .
__g .o .o
5 g .o .o o o
t .o .o o o
a . o o
g o o o o o
t
a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a
|5 |10 |15 |20 |25 |30 |35
Dotplot Principle - Improved Again
t
t
__t
25g O O
a O O
g O
a
__a
20g
t
a O
t O
__c O O
15a O
a O
a O O
t O
__g O
10a O O
a O O
c O O O O
t O O
__g O
5 g O O O
t
a
g
t
a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a
|5 |10 |15 |20 |25 |30 |35
Three conclusions can be drawn from Figure D:
Interpretation of Dotplots
vertical horizontal
sequence sequence
short diagonal 5-10 4-9
long diagonal 5-15 9-21
This is an important conclusion, as the vertical sequence
has obviously a region in its beginning which is similar to the
horizontal sequence in two different areas (4-9, and 9-15, respectively).
However, be careful if you use window/stringency as
the coordinates plotted as diagonals will be affected by the
size of the window. The actual region of similarity,
therefore, will need to be expanded by (window size/2). If we
schematically write the sequences in a letter-by-letter format,
however, it will become immediately obvious that the window/stringency
algorithm averages tremendously:
tgatggtcaagtaaactatgaagagttt vertical sequence
||||| | | | || (short diagonal)
atggtaatggcacaattgactttcctgaatttctga horizontal sequence
| |||| | | ||| || ||| (long diagonal)
tgatggtcaagtaaactatgaagagttt vertical sequence
In this case, experimental evidence will
be required to consolidate the computer prediction of whether
either the "short" or "long" diagonal are of biological relevance.
The protein comparison will be valuable if available
or possible. Tip for the Interpretation of Dotplots:
Always try to write down diagonals of interest in the
way as depicted above. If you need computerised assistance, use
the 'gap' program of the
GCG package with very high
gap penalty values (e.g., 50). Explanations on 'gap' can be found
in a later section of this chapter.
GCG's Implementation of Schematic Comparison
Comparison Calculation
Display Program
Detection of Internal Repeats
Principle of the Analytical Comparison of Two Sequences
Motivation
Match value: 2
Mismatch value: -1
Gap penalty: -5
Gap length penalty: -1
We can improve some of our alignments. For technical reasons
of printability, the two sequences are shortened to tgatggtcaagtaaactatgaag
and atggtaatggcacaattgacttt (which
does not affect the principle). Examples of alignments could
be
tgatggtcaa.gtaaactat.gaag score: 14*2+5*(-1)+4*(-5)+4*(-1) = -1
||||| || | || || || 14 match 5 mismatch
atggt.aatg.gcacaattgacttt 4 gaps 4 gap length
tgatggtcaagtaaactatgaag score:9*2+11*(-1)+1*(-5)+4*(-1) = -2
|| | | | | ||| 9 match 11 mismatch
atggtaat...ggcacaattgacttt 1 gap 3 gap length
The assignment of values is fairly arbitrary and allows
for significant changes in the result. If we take the standard
values of the gap program (as detailed below)
we end up with the following calculation:
Match value: 1
Mismatch value: 0
Gap penalty: -5
Gap length penalty: -0.3
tgatggtcaa.gtaaactat.gaag score: 14*1+5*(0)+4*(-5)+4*(-0.3) = -7.2
||||| || | || || || 14 match 5 mismatch
atggt.aatg.gcacaattgacttt 4 gaps 4 gap length
tgatggtcaagtaaactatgaag score:9*1+11*(0)+1*(-5)+4*(-0.3) = 2.8
|| | | | | ||| 9 match 11 mismatch
atggtaat...ggcacaattgacttt 1 gap 3 gap length
As these two examples show, the insertion and modification
of gaps allows for a very broad variation of possibilities. Additionally,
the judgement of the two possible alignments is possibly not
satisfactory as we have not evaluated other possible alignments
of the two sequences. In particular, the optimisation
of the score at a given set of parameters, i.e. the
calculation of best alignment, would require to try out all
possible combinations, which is a fairly time- consuming task.
An automatic procedure, therefore, is required
to evaluate the possible solutions extensively. The algorithm
for this purpose was first described by Smith and Watermann
and usually implemented in a dynamic programming
approach.
Letter-by-Letter Alignment Prerequisites
Symbol Comparison Tables
DNA sequence
determines |
V
protein sequence
determines |
V
secondary structure
determines |
V
tertiary structure (protein)
determines |
V
protein function
Two approaches have been used successfully in standard
implementations of biocomputing software:
Alignment Path Matrices
one gap: 1 (previous) -5 (gap) -1 (mismatch)
/ = -5
/ direct: -3 (previous) -1 (mismatch) = -4
+------+------+------+// one gap: 0 (previous) -5 (gap) -1 (mismatch)
c | -1 | -2 | 0 /// one longer gap: = -6
+------+------______////----+ 6 (previous) -5 (gap) -1 (gap length)
t | -1 | 1 /| -3 /// -1 | -1 (mismatch) = -1
+------+------+-----||------+
g | -1 | -2 | 0/ | 8 |
+------+------+-----/+------+ min of (-5, -4, -6, -1) = -1
g | -1 | -2 | 6/ | -1 | value printed, therefore: -1
+------+------+------+------+
...| | | |
a t g g
To compute all beneficial fields quickly, some implementations
do not compute "chanceless" pathways and reduce, therefore, memory
and time requirements. This, typically, applies to the edges
of sequence alignment path matrices (in our example, the upper
left and lower right corner). The GCG program implementation
claims a "value not guaranteed to be optimal" but usually the
edges do not yield a much better value.
g -1
a 2
a 2
g -1
t -1
a 2
t -1
c -1
a 2
a 2
a 2
t -1
g -1
a 2
a 2
c -1
t -1
g -1
g -1
t -1
a 2
g -1
t -1 2 -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 -1 -1 2 2 -1 -1 -1 2 2 2
a t g g t a a t g g c a c a a t t g a c t t t
Figure F: Alignment path matrix in an
advanced stage. Scores are calculated as described in the text
using the following values: direct: score + previous. score value
= 2 gap: score + previous -5 mismatch = -1 gap length = -1 Note
the (X) which is explained in detail in the text.
g -1 1 3
a 2 1 -3
a 2 -2 -3
g -1 -2 6
t -1 4 -3
a 2 -2 0
t -1 1 0
c -1 1 0
a 2 1 0
a 2 1 -3
a 2 -2 0
t -1 1 0
g -1 1 3
a 2 1 -3
a 2 -2 -3
c -1 -2 0 (X)
t -1 1 -3 -1
g -1 -2 0 8
g -1 -2 6 -1
t -1 4 -3 -4
a 2 -2 -3 3 0 0 3 -3 -3 3 0 0 -3 0 0 -3 -3 0 6 -3 -3 -3 0
g -1 -2 4 1 -2 1 -2 -2 4 1 -2 -2 -2 -2 -2 -2 1 4 -2 -2 -2 1 1
t -1 2 -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 -1 -1 2 2 -1 -1 -1 2 2 2
a t g g t a a t g g c a c a a t t g a c t t t
The final result of the alignments is painted in Figure
G. The evaluation
of the "best" alignment is achieved by seeking the highest value
or in the edges, and processing along degressing score, starting
(in our example) at the upper right moving downwards. Figure
F shows this exemplarically for the two best diagonals
found.
g -1 1 3 -1 -1 3 1 3 3 -2 0 -1 -1 2 0 4 5 7 4 11 12 6 3
a 2 1 -3 0 4 2 4 1 -4 -2 0 0 3 1 5 6 5 5 12 13 7 4 3
a 2 -2 -3 5 0 2 2 -4 -1 1 -2 4 -1 3 7 3 6 10 14 6 5 4 4
g -1 -2 6 -1 -2 0 -3 1 2 -1 2 0 1 5 2 5 11 12 7 6 5 5 6
t -1 4 -3 -1 1 -3 2 0 -3 0 1 0 6 3 6 12 10 8 7 6 5 7 13
a 2 -2 0 -1 -2 3 -2 -2 1 2 1 7 4 7 10 8 9 8 7 3 5 11 10
t -1 1 0 -1 1 -4 -4 2 3 2 5 5 5 8 9 10 9 5 1 4 12 11 6
c -1 1 0 -1 -5 -3 -1 4 3 6 6 3 9 10 8 7 6 2 2 11 9 4 3
a 2 1 0 -4 -2 0 5 4 4 1 4 7 11 9 8 7 3 3 9 10 5 4 3
a 2 1 -3 -1 -2 3 5 5 2 5 5 12 7 6 8 4 3 7 11 6 2 1 4
a 2 -2 0 -1 1 3 6 3 6 6 10 8 4 6 5 4 8 9 7 3 -2 5 4
t -1 1 0 2 1 4 4 7 7 11 6 5 4 3 5 9 10 2 4 -1 3 2 -1
g -1 1 3 -1 -1 5 5 8 12 7 3 2 4 6 7 8 3 5 -2 1 0 -3 1
a 2 1 -3 -4 0 6 8 10 5 4 2 5 7 8 9 2 0 -1 2 1 -2 2 -3
a 2 -2 -3 -1 1 7 11 3 2 3 3 8 6 7 3 -1 -2 0 2 -4 3 -2 -2
c -1 -2 0 -1 2 9 4 3 4 -1 6 7 5 1 0 -1 1 0 -3 4 -2 -1 -1
t -1 1 -3 -1 10 2 1 5 -2 1 8 3 2 1 0 2 1 -3 2 -1 1 1 -2
g -1 -2 0 8 0 -1 3 -2 2 9 1 0 0 -2 -3 -4 -3 3 0 -1 -1 -3 1
g -1 -2 6 -1 -3 4 -1 -2 7 2 -2 1 -2 -2 -5 -2 1 1 -5 0 4 2 1
t -1 4 -3 -4 5 -1 -1 5 -3 -2 2 -1 -1 -4 -1 2 -1 -4 -1 5 3 2 -4
a 2 -2 -3 3 0 0 3 -3 -3 3 0 0 -3 0 0 -3 -3 0 6 -2 -3 -3 0
g -1 -2 4 1 -2 1 -2 -2 4 1 -2 -2 -2 -2 -2 -2 1 4 -2 -2 -2 1 1
t -1 2 -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 -1 -1 2 2 -1 -1 -1 2 2 2
a t g g t a a t g g c a c a a t t g a c t t t
Figure G: Alignment path matrix in an
completed stage. Scores are calculated as described in the text
using values listed in Figure E. Note the positive numbers of
the two most promising alignments.
g 12
a 13/
a 14/
g ___12/
t 12/ 13
a 10/ 11/
t | 12/
c 10/ 11/
a 11/ |
a 12/ 11/
a 10/ 9/
t 11/ 10/
g 12/ 8/
a 10/ 9/
a 11/ ___7/
c 9/ 7/
t 10/ 8/
g 8/ 9/
g 6/ 7/
t 4/ 5/
a 2/ 3/
g 1/
t 2/
a t g g t a a t g g c a c a a t t g a c t t t
tgatggtcaagtaaac.tatgaag tgatgg.tcaagtaaactatgaag
||||| | | | | || | |||| ||| | ||| |
atggtaatggcacaatt.gacttt atggtaatggcacaatt.gacttt
Comparison Programs
Two Sequences of Similar Length
Two Sequences of Different Length
DNA and Protein Sequences
Programs to Display Two Aligned Sequences
Significance Evaluation
JAM produced file:
COMPARI9.HTML as [next page] , or [overview] , or [table of contents]