[ Previous chapter ][
This chapter ][ Next chapter ]
Let us assume the following two sequences:
One additional value
is missing:
If the two sequences have different size, some symbols of
one sequence
will never have a counterpart. The score of
any symbol to "nothing"
is therefore assumed to be
0.
To get started, we will write the two sequences amongst each other like in
the painting above.
However, we can either align the beginning, the end, or
arrange the sequences arbitrarily.
Figure A
below shows our two sequences
shifted
by various positions. The
score
is
determined according to the
table above.
Figure A:
Sequence alignments
produced by shifting.
Scores are calculated using a Match value of 1 and a mismatch
value of
0. Numbers in parenthesis refer to the calculation
with mismatch values of -0.5.
Figure A
allows to conclude elementary findings:
The scoring table, if written as best-score listing of the
top four alignments, will read
as:
This type of scoring will favour
long alignments
and will produce higher
scores the longer the alignments
are. However, the mismatches are not penalised, which implies
that
long stretches of different sequences might be in the alignment.
The result will be that
the score gets better if the alignment gets longer,
regardless the amount of mismatches encountered.
In order to discriminate better between similar sequences and
those which have
accidental
similarity on a long range of symbols, such as expected in G/C rich
sequences, we need
to change scoring to
penalise
mismatches. As an example, we use the scoring
The
dotplot
method allows visual inspection
of all possible alignments
in schematic fashion, and is shown in
Figures
B and C .
Figure B displays the very basic
dotplot:
Two sequences are
plotted as a matrix, and identical symbols
get an
x
(for technical reasons
only, in graphic output this is a "dot").
As the two sequences are 28 and 36 base pairs in
length, we will
have 28 x 36 = 1008 positions to calculate. Our DNA alphabet is
a 4-letter
alphabet (as we treat T and U as equal), which means that
the
random chance
of
an identical symbol at any given position is 1/4 = 0.25. Therefore,
if the two sequences are
totally unrelated, we expect 0.25 x 1008 = 252
dots, or, as formula,
Figure B:
Dot-plot created by painting a dot (x) at
each match.
Obviously, we still need to improve the so-called
signal - to - noise
ratio.
The
signal
is a line or otherwise visible pattern which we could use in the
visual inspection. The
noise
is what we can expect from statistics: If
we use a mathematical
approximation to count the probability in a four-letter alphabet,
we
expect 25% of a random hit probability, which is too high if
we ask for weak similarities -
remember that the best score we had
was
We need an
improvement for the
dotplots
with the
word method:
In our example,
on a length of 28 base pairs, we have
only 15 matches which is
approximately every second nucleotide. This is a relatively
weak
signal. Therefore, we use a first approximation: We do
no longer point a dot in if each nucleotide
matches, but we use
oligomers
(called words) and paint a dot if these
words
match. This reduces the
chance of a random match. If we use di-nucleotides, accidental
matches will be (1/4)*(1/4) = 1/16 = 6.25% which is already much
lower than the 25% obtained
earlier. The GCG program suite uses the default
word size
of
6
- this is (1/4) to the power of 6, which results in a random
choice probability of
0.025%. There is, however, an undesired side-effect:
the larger the word size, the lower is
the probability
that a given word matches in between two different sequences.
Expressed as
formula, this will read
The problem
with the sensitivity (too few subsequent identities
in the case of low similarity) can be overcome
with the permission
of
mismatches
in a word. This is the already known
windows technique :
We select a
window
and request that a minimal number of matches within this
window is
obtained. The GCG programs call this a
stringency.
Therefore,
using a
window/stringency
algorithm,
we will be able to paint dots at the middle
of a
window
rather than a
word, which means that, given the values
9/5
for
window/stringency,
we obtain the plot as shown in
Figure D.
Figure
D:
Dot-plot with
a dot (0) at each matching window of 9 -
with a minimum of 5 matches (stringency) per window.
Looking at
Figure D,
two main diagonals
can be identified:
Section 9.1: Schematic Comparison
Subsection 9.1.1 Principle of Sequence Alignment
My1.seq tgatggtcaagtaaactatgaagagttt
unknown seq atggtaatggcacaattgactttcctgaatttctga
If we want to
align
those, we will try to write the two sequences
in a
way which allows a pairwise comparison of each sequence
symbol. As you might guess,
there are lots of possible options
to do so, and the longer the sequences are the more options
to align two sequences will exist. In order to find the
best
alignment,
we need to judge the quality of the alignment.
To allow computations and comparisons, this
judgement shall result
in a numerical value, which is called a
score
.
The determination of this score relies on a
symbol comparison table,
where
each symbol pairing gets a value assigned, in order to
determine the overall score by adding
up the comparison value
of each observed pair in our alignment. These tables are
very important in the protein field, but also used in DNA
comparison. A typical, simple scoring table for
nucleotides
will give a value of
1
to a match (treating U and T as "match"),
and assign a value of
0
to each mismatch:
Match value: 1
Mismatch value: 0
+-----+-----+-----+-----+-----+-----+
| | A | G | C | T | U |
+-----+-----+-----+-----+-----+-----+
| A | 1 | 0 | 0 | 0 | 0 |
+-----+-----+-----+-----+-----+-----+
| G | 0 | 1 | 0 | 0 | 0 |
+-----+-----+-----+-----+-----+-----+
| C | 0 | 0 | 1 | 0 | 0 |
+-----+-----+-----+-----+-----+-----+
| T | 0 | 0 | 0 | 1 | 1 |
+-----+-----+-----+-----+-----+-----+
| U | 0 | 0 | 0 | 1 | 1 |
+-----+-----+-----+-----+-----+-----+
This matrix is perfectly symmetric and would
be sufficient
if printed as half-populated table.
tgatggtcaagtaaactatgaagagttt
| | | || shift 4: score 5 (-4.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| || || || | | shift 3: score 9 (+1.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
||||| | | | || shift 2: score 10 (+2.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| | | | | | shift 1: score 6 (-5.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| | shift 0: score 2 (-11.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
|| | || | shift -1: score 6 (-5.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| | | | shift -2: score 4 (-8.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| || || shift -3: score 5 (-6.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| |||| | | ||| || ||| shift -4: score 15 (+8.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| ||| | | | | || shift -5: score 10 (+1.0)
atggtaatggcacaattgactttcctgaatttctga
Score Shift Length
------------------------
15 -4 29
10 2 27
-5 29
9 3 26
This means that one alignment with
shift -4
is calculated to be
"best"
but the alignments with
shift 2, -5 and 3
are
of a similar score.
Match value: +1.0
Mismatch value: -0.5
and recalculate scores.
Figure A
shows the values in parenthesis.
The scoring table, if written as best-score listing
of the
top four alignments, will now read as:
Score Shift Length
------------------------
+8.5 -4 28
+2.0 2 26
+1.0 -5 28
3 25
The main benefit of this scoring schema is a
quality
discrimination:
All alignments which have twice as much mismatches than matches
will
score
negatively.
This implies that we can now introduce a
threshold
and indicate a "reasonable" alignment to be of a
"positive score".
However, we have not tried all of the possible shifts, and it is
not easily
feasible to compare several kb of sequences this way. Therefore,
we need an automatism which
allows to judge sequence alignments
after visual inspection.
Subsection 9.1.2 Principle of Dotplots
Number of possible dots =
(probability of pair) * (length of sequence A) * (length of Sequence B)
Counting the
x
in Figure B gives 278 dots, which is fairly close to
the expected value.
t x x x x x x x x x x x x x
t x x x x x x x x x x x x x
__t x x x x x x x x x x x x x
25g x x x x x x x
a x x x x x x x x x x
g x x x x x x x
a x x x x x x x x x x
__a x x x x x x x x x x
20g x x x x x x x
t x x x x x x x x x x x x x
a x x x x x x x x x x
t x x x x x x x x x x x x x
__c x x x x x x
15a x x x x x x x x x x
a x x x x x x x x x x
a x x x x x x x x x x
t x x x x x x x x x x x x x
__g x x x x x x x
10a x x x x x x x x x x
a x x x x x x x x x x
c x x x x x x
t x x x x x x x x x x x x x
__g x x x x x x x
5 g x x x x x x x
t x x x x x x x x x x x x x
a x x x x x x x x x x
g x x x x x x x
t x x x x x x x x x x x x x
a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a
|5 |10 |15 |20 |25 |30 |35
Looking at
Figure B,
we can draw several conclusions:
tgatggtcaagtaaactatgaagagttt
| |||| | | ||| || ||| shift -4: score 15 (+8.5)
atggtaatggcacaattgactttcctgaatttctga
Subsection 9.1.3 Dotplot Principle - Improved
Number of possible dots =
(probability of word) * (length of sequence A) * (length of Sequence B)
Figure C:
Dot-plot painting a dot (o) at each matching di-nucleotide
- more
details are described in the text.
t o o o o o
t o o o o o
__t o
25g
a o o
g
a o o
__a o o o
20g o o o o
t o o o
a o o o o
t .o o o
__c o .o
15a o . o . o
a o . o . o
a . .
t o . .
__g . .
10a o. . o o
a . o. o
c . . o o
t .o .
__g .o .o
5 g .o .o o o
t .o .o o o
a . o o
g o o o o o
t
a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a
|5 |10 |15 |20 |25 |30 |35
The result of the application in case of a di-nucleotide
match (word size 2) is
shown in
Figure C:
In total, 65 dots (painted as
o)
are painted in the view of (1008 x 0.0625) = 63
expected. In the figure,
dots (.) have been painted in
suggestively in order to show the position of the two best
hits obtained in the alignments displayed in
Figure A . The reason for the weak appearance
is the low similarity of the two sequences.
Subsection 9.1.4 Dotplot Principle - Improved Again
t
t
__t
25g O O
a O O
g O
a
__a
20g
t
a O
t O
__c O O
15a O
a O
a O O
t O
__g O
10a O O
a O O
c O O O O
t O O
__g O
5 g O O O
t
a
g
t
a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a
|5 |10 |15 |20 |25 |30 |35
Three conclusions can be drawn from
Figure D:
Subsection 9.1.5 Interpretation of Dotplots
vertical horizontal
sequence sequence
short diagonal 5-10 4-9
long diagonal 5-15 9-21
This is an important conclusion, as the vertical sequence
has obviously a region in
its beginning which is similar to the
horizontal sequence in two different areas (4-9, and
9-15, respectively).
However,
be careful if you use window/stringency
as
the coordinates plotted as diagonals will be affected by the
size of the
window.
The actual region of similarity, therefore, will need to be expanded
by (window size/2).
If we schematically write the sequences
in a letter-by-letter format, however, it will become
immediately
obvious that the window/stringency algorithm
averages
tremendously:
tgatggtcaagtaaactatgaagagttt vertical sequence
||||| | | | || (short diagonal)
atggtaatggcacaattgactttcctgaatttctga horizontal sequence
| |||| | | ||| || ||| (long diagonal)
tgatggtcaagtaaactatgaagagttt vertical sequence
In this case,
experimental evidence
will be required to consolidate
the computer prediction
of whether either the "short" or "long" diagonal are of biological
relevance. The
protein
comparison will be valuable if available or possible.
Tip for the Interpretation of Dotplots:
Always try to write down diagonals of interest in the way as depicted
above. If you
need computerised assistance, use the 'gap' program
of the
GCG package
with very high gap penalty values (e.g.,
50). Explanations on 'gap'
can be found in a later section of this chapter.
[ previous chapter ],[
this chapter ][ next chapter ]
, [next page/section] , or [overview] , or [table of contents]