JAMF Archive

BioCompanion as published in 1995
THIS IS THE REFERENCE CODE AS PUBLISHED.
		Doelz, R.   
		Optimal production of biological documentation: the JAM format.
		Comput. Applic. Biosci. 11, 224-226 (1995).    
		
The version you are currently viewing is the one printed and distributed via the Internet from the server of BioComputing Basel. Version 3.1 of the BioCompanion was published with version 2 of the JAMF software. The server that was indicated in the documentation has ceased to exist.

Version 3.2 of the BioCompanion was not publicly available for free but was shareware that was distributed with GCG's software release 9. For the purpose of enhanced editing, JAMF was partially rewritten and the proprietary version 3.x of JAMF was used from 1996 onwards. The Biocompanion is available in a current version from the publisher . It has significantly changed both in software and content.

JAMF source code

LATEX version source code

	

location: Home > Archive > BioCompanion V2.x (1995)

Chapter 9: ComparisonofTwoSequences

Comparison of Two Sequences


Schematic Comparison

Principle of Sequence Alignment

Let us assume the following two sequences:

 
  
My1.seq       tgatggtcaagtaaactatgaagagttt  
unknown seq   atggtaatggcacaattgactttcctgaatttctga  
  
If we want to align those, we will try to write the two sequences in a way which allows a pairwise comparison of each sequence symbol. As you might guess, there are lots of possible options to do so, and the longer the sequences are the more options to align two sequences will exist. In order to find the best alignment, we need to judge the quality of the alignment. To allow computations and comparisons, this judgement shall result in a numerical value, which is called a score . The determination of this score relies on a symbol comparison table, where each symbol pairing gets a value assigned, in order to determine the overall score by adding up the comparison value of each observed pair in our alignment. These tables are very important in the protein field, but also used in DNA comparison. A typical, simple scoring table for nucleotides will give a value of 1 to a match (treating U and T as "match"), and assign a value of 0 to each mismatch:
 
  
            Match value: 1  
         Mismatch value: 0  
           
           
         +-----+-----+-----+-----+-----+-----+  
         |     |  A  |  G  |  C  |  T  |  U  |  
         +-----+-----+-----+-----+-----+-----+  
         |  A  |  1  |  0  |  0  |  0  |  0  |  
         +-----+-----+-----+-----+-----+-----+  
         |  G  |  0  |  1  |  0  |  0  |  0  |  
         +-----+-----+-----+-----+-----+-----+  
         |  C  |  0  |  0  |  1  |  0  |  0  |  
         +-----+-----+-----+-----+-----+-----+  
         |  T  |  0  |  0  |  0  |  1  |  1  |  
         +-----+-----+-----+-----+-----+-----+  
         |  U  |  0  |  0  |  0  |  1  |  1  |  
         +-----+-----+-----+-----+-----+-----+  
  
This matrix is perfectly symmetric and would be sufficient if printed as half-populated table.

One additional value is missing: If the two sequences have different size, some symbols of one sequence will never have a counterpart. The score of any symbol to "nothing" is therefore assumed to be 0.

To get started, we will write the two sequences amongst each other like in the painting above. However, we can either align the beginning, the end, or arrange the sequences arbitrarily. Figure A below shows our two sequences shifted by various positions. The score is determined according to the table above.

Figure A: Sequence alignments produced by shifting. Scores are calculated using a Match value of 1 and a mismatch value of 0. Numbers in parenthesis refer to the calculation with mismatch values of -0.5.

 
            
tgatggtcaagtaaactatgaagagttt  
         | |     |       ||              shift  4: score 5   (-4.5)  
    atggtaatggcacaattgactttcctgaatttctga  
  
tgatggtcaagtaaactatgaagagttt  
     |  ||    || ||  |   |               shift  3: score 9   (+1.0)  
   atggtaatggcacaattgactttcctgaatttctga  
  
tgatggtcaagtaaactatgaagagttt  
  ||||| | |  |    ||                     shift  2: score 10  (+2.0)  
  atggtaatggcacaattgactttcctgaatttctga  
  
tgatggtcaagtaaactatgaagagttt  
    |     | | | |         |              shift  1: score 6   (-5.5)  
 atggtaatggcacaattgactttcctgaatttctga  
  
tgatggtcaagtaaactatgaagagttt  
                |        |               shift  0: score 2   (-11.0)  
atggtaatggcacaattgactttcctgaatttctga  
  
 tgatggtcaagtaaactatgaagagttt  
 ||    |     ||   |                      shift -1: score 6   (-5.0)  
atggtaatggcacaattgactttcctgaatttctga  
  
  tgatggtcaagtaaactatgaagagttt  
           |  |           |  |           shift -2: score 4   (-8.0)  
atggtaatggcacaattgactttcctgaatttctga  
  
   tgatggtcaagtaaactatgaagagttt  
        | ||                 ||          shift -3: score 5   (-6.5)  
atggtaatggcacaattgactttcctgaatttctga  
  
    tgatggtcaagtaaactatgaagagttt  
    | ||||   | |  |||     || |||         shift -4: score 15  (+8.5)  
atggtaatggcacaattgactttcctgaatttctga  
  
     tgatggtcaagtaaactatgaagagttt  
         |  ||| | |  |      | ||         shift -5: score 10  (+1.0)   
atggtaatggcacaattgactttcctgaatttctga  
  

Figure A allows to conclude elementary findings:

The scoring table, if written as best-score listing of the top four alignments, will read as:

 
  
       Score    Shift   Length   
       ------------------------  
         15      -4       29  
         10       2       27  
                 -5       29  
          9       3       26  
  
This means that one alignment with shift -4 is calculated to be "best" but the alignments with shift 2, -5 and 3 are of a similar score.

This type of scoring will favour long alignments and will produce higher scores the longer the alignments are. However, the mismatches are not penalised, which implies that long stretches of different sequences might be in the alignment. The result will be that the score gets better if the alignment gets longer, regardless the amount of mismatches encountered. In order to discriminate better between similar sequences and those which have accidental similarity on a long range of symbols, such as expected in G/C rich sequences, we need to change scoring to penalise mismatches. As an example, we use the scoring

 
  
            Match value: +1.0  
         Mismatch value: -0.5  
  
and recalculate scores. Figure A shows the values in parenthesis. The scoring table, if written as best-score listing of the top four alignments, will now read as:
 
  
       Score    Shift   Length   
       ------------------------  
        +8.5     -4       28  
        +2.0      2       26  
        +1.0     -5       28  
                  3       25  
  
The main benefit of this scoring schema is a quality discrimination: All alignments which have twice as much mismatches than matches will score negatively. This implies that we can now introduce a threshold and indicate a "reasonable" alignment to be of a "positive score". However, we have not tried all of the possible shifts, and it is not easily feasible to compare several kb of sequences this way. Therefore, we need an automatism which allows to judge sequence alignments after visual inspection.

Principle of Dotplots

The dotplot method allows visual inspection of all possible alignments in schematic fashion, and is shown in Figures B and C .

Figure B displays the very basic dotplot: Two sequences are plotted as a matrix, and identical symbols get an x (for technical reasons only, in graphic output this is a "dot"). As the two sequences are 28 and 36 base pairs in length, we will have 28 x 36 = 1008 positions to calculate. Our DNA alphabet is a 4-letter alphabet (as we treat T and U as equal), which means that the random chance of an identical symbol at any given position is 1/4 = 0.25. Therefore, if the two sequences are totally unrelated, we expect 0.25 x 1008 = 252 dots, or, as formula,

 
Number of possible dots =   
        (probability of pair) * (length of sequence A) * (length of Sequence B)  
Counting the x in Figure B gives 278 dots, which is fairly close to the expected value.

Figure B: Dot-plot created by painting a dot (x) at each match.

 
  
  t   x     x     x               x x       x x x     x       x x x   x  
  t   x     x     x               x x       x x x     x       x x x   x  
__t   x     x     x               x x       x x x     x       x x x   x  
25g     x x         x x               x                 x               x  
  a x         x x         x   x x       x                 x x             x  
  g     x x         x x               x                 x               x  
  a x         x x         x   x x       x                 x x             x  
__a x         x x         x   x x       x                 x x             x  
20g     x x         x x               x                 x               x  
  t   x     x     x               x x       x x x     x       x x x   x  
  a x         x x         x   x x       x                 x x             x  
  t   x     x     x               x x       x x x     x       x x x   x  
__c                     x   x             x       x x               x  
15a x         x x         x   x x       x                 x x             x  
  a x         x x         x   x x       x                 x x             x  
  a x         x x         x   x x       x                 x x             x  
  t   x     x     x               x x       x x x     x       x x x   x  
__g     x x         x x               x                 x               x  
10a x         x x         x   x x       x                 x x             x  
  a x         x x         x   x x       x                 x x             x  
  c                     x   x             x       x x               x  
  t   x     x     x               x x       x x x     x       x x x   x  
__g     x x         x x               x                 x               x  
5 g     x x         x x               x                 x               x  
  t   x     x     x               x x       x x x     x       x x x   x  
  a x         x x         x   x x       x                 x x             x  
  g     x x         x x               x                 x               x  
  t   x     x     x               x x       x x x     x       x x x   x  
    a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a  
            |5        |10       |15       |20       |25       |30       |35  
  
Looking at Figure B, we can draw several conclusions:

Obviously, we still need to improve the so-called signal - to - noise ratio. The signal is a line or otherwise visible pattern which we could use in the visual inspection. The noise is what we can expect from statistics: If we use a mathematical approximation to count the probability in a four-letter alphabet, we expect 25% of a random hit probability, which is too high if we ask for weak similarities - remember that the best score we had was

 
  
    tgatggtcaagtaaactatgaagagttt  
    | ||||   | |  |||     || |||         shift -4: score 15  (+8.5)  
atggtaatggcacaattgactttcctgaatttctga  
  

Dotplot Principle - Improved

We need an improvement for the dotplots with the word method: In our example, on a length of 28 base pairs, we have only 15 matches which is approximately every second nucleotide. This is a relatively weak signal. Therefore, we use a first approximation: We do no longer point a dot in if each nucleotide matches, but we use oligomers (called words) and paint a dot if these words match. This reduces the chance of a random match. If we use di-nucleotides, accidental matches will be (1/4)*(1/4) = 1/16 = 6.25% which is already much lower than the 25% obtained earlier. The GCG program suite uses the default word size of 6 - this is (1/4) to the power of 6, which results in a random choice probability of 0.025%. There is, however, an undesired side-effect: the larger the word size, the lower is the probability that a given word matches in between two different sequences. Expressed as formula, this will read

 
Number of possible dots =   
        (probability of word) * (length of sequence A) * (length of Sequence B)  
The result of the application in case of a di-nucleotide match (word size 2) is shown in Figure C: In total, 65 dots (painted as o) are painted in the view of (1008 x 0.0625) = 63 expected. In the figure, dots (.) have been painted in suggestively in order to show the position of the two best hits obtained in the alignments displayed in Figure A . The reason for the weak appearance is the low similarity of the two sequences.

Figure C: Dot-plot painting a dot (o) at each matching di-nucleotide - more details are described in the text.

 
  
  t                                o         o o               o o       
  t                                o         o o               o o       
__t        o                                                             
25g                                                                       
  a                                    o                 o                  
  g                                                                       
  a                            o                           o                
__a                                    o                 o               o  
20g    o           o                 o                 o                   
  t  o           o                                           o           
  a          o                       o                 o               o   
  t                                       .o         o               o   
__c                        o            .o                            
15a            o          .    o      .                    o                 
  a            o        .      o    .                      o                 
  a                   .           .                                          
  t        o        .           .                                        
__g               .           .                                          
10a            o.           .  o                           o                 
  a           .          o.  o                                               
  c         .           .                         o                 o   
  t       .o          .                                                 
__g     .o          .o                                                     
5 g   .o          .o                 o                 o                   
  t .o          .o               o                            o            
  a           .                                          o               o  
  g    o           o                 o                 o               o  
  t                                                                      
    a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a     
            |5        |10       |15       |20       |25       |30       |35  
  

Dotplot Principle - Improved Again

The problem with the sensitivity (too few subsequent identities in the case of low similarity) can be overcome with the permission of mismatches in a word. This is the already known window technique : We select a window and request that a minimal number of matches within this window is obtained. The GCG programs call this a stringency. Therefore, using a window/stringency algorithm, we will be able to paint dots at the middle of a window rather than a word, which means that, given the values 9/5 for window/stringency, we obtain the plot as shown in Figure D.



Figure D: Dot-plot with a dot (0) at each matching window of 9 - with a minimum of 5 matches (stringency) per window.

 
  
  t                                                                      
  t                                                                      
__t                                                                      
25g                                       O                 O              
  a                                     O                 O                  
  g                                                           O            
  a                                                                          
__a                                                                          
20g                                                                        
  t                                                                      
  a                               O                                          
  t                                                               O      
__c                                       O                     O      
15a                                     O                                    
  a                                   O                                      
  a                                 O O                                      
  t                                 O                                    
__g                               O                                        
10a               O             O                                            
  a             O             O                                              
  c           O           O O                           O              
  t         O           O                                                
__g                   O                                                    
5 g                 O                       O                 O            
  t                                                                      
  a                                                                          
  g                                                                        
  t                                                                      
    a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a     
            |5        |10       |15       |20       |25       |30       |35  
  
Three conclusions can be drawn from Figure D:

Interpretation of Dotplots

Looking at Figure D, two main diagonals can be identified:

 
                        vertical      horizontal                 
                        sequence       sequence                    
  
short diagonal           5-10            4-9  
  
long diagonal            5-15            9-21  
  
This is an important conclusion, as the vertical sequence has obviously a region in its beginning which is similar to the horizontal sequence in two different areas (4-9, and 9-15, respectively). However, be careful if you use window/stringency as the coordinates plotted as diagonals will be affected by the size of the window. The actual region of similarity, therefore, will need to be expanded by (window size/2). If we schematically write the sequences in a letter-by-letter format, however, it will become immediately obvious that the window/stringency algorithm averages tremendously:
 
  
tgatggtcaagtaaactatgaagagttt             vertical sequence   
  ||||| | |  |    ||                     (short diagonal)  
  atggtaatggcacaattgactttcctgaatttctga   horizontal sequence  
      | ||||   | |  |||     || |||       (long diagonal)  
      tgatggtcaagtaaactatgaagagttt       vertical sequence  
  
In this case, experimental evidence will be required to consolidate the computer prediction of whether either the "short" or "long" diagonal are of biological relevance. The protein comparison will be valuable if available or possible. Tip for the Interpretation of Dotplots: Always try to write down diagonals of interest in the way as depicted above. If you need computerised assistance, use the
'gap' program of the GCG package with very high gap penalty values (e.g., 50). Explanations on 'gap' can be found in a later section of this chapter.


GCG's Implementation of Schematic Comparison

We will assume that all these setup operations have been successfully completed. Note that the methods described below are valid for both DNA and protein sequences.

Comparison Calculation

compare calculates the dots to be displayed later , comes with two different algorithms; the "window/stringency" as default. You might try compare/word for really large sequences.

Display Program

dotplot displays dots calculated by compare, and will look nicer if you use additional options like the following. For an overview of possible options, use dotplot with the "check" option.

Recommendation for the 'dotplot' program:

Nicer figures will be obtained if you give the following command line options. If you use WPI, make sure that the command line options are ticked as indicated below.

dotplot /tickaxes /symbol=2 /font=3

================================= Begin Exercise 8

Schematic pairwise DNA analysis: Compare two sequences using the 'dotplot' technique.

Using previous exercise results, you should have two DNA sequences by now: my1.seq as the typed-in sequence, and my2.seq as the reading-frame extracted DNA sequence from the seqed exercise. You shall compare these two sequences now.

To solve this problem, follow this schedule:

================================= End Exercise 8

Detection of Internal Repeats

It is important to know as much about the sequence of interest as possible. The dotplot as explained above may be used to analyse sequences internally, i.e., you may compare a sequence against itself. The dotplot as such becomes perfectly symmetrical. The gcg implementation of the 'dotplot' program will recognise that the sequences on both axes are identical and will, therefore, plot only half of the sequence. You might want to force a full display with

$ dotplot /all

The benefit of an internal repeat analysis will be obvious if you encounter gene duplication or the occurrence of several functional protein motifs.

================================= Begin Exercise 9

Internal repeat analysis: Analyse a single sequence using the 'dotplot' technique in order to find internal repeats.

Using previous exercise results, you should have a database DNA sequence my2.seq, and the translated sequence, m19311.pep as peptide. You shall compare these two sequences now with itself on each DNA and protein level, and compare the results. Note that, in particular on protein level, the adjustment of window and stringency values might be a lengthy process.

To solve this problem, follow this schedule:

Tip for the Superimposition of Dotplots: All gcg graphics routines offer a "density" to plot the output. If you do not accept the default value but use a number dividable by three for DNA, and the corresponding number for the protein comparison.

================================= End Exercise 9


Principle of the Analytical Comparison of Two Sequences

The problem of the schematic alignment is its lacking ability to deal with gaps - elements which have no counterpart in the opposite sequence. Mathematics become difficult as we introduce this element with two rather than one property:

Motivation

If we try this without two sample sequences as already known from the schematic comparison with

 
  
            Match value:  2  
         Mismatch value: -1  
            Gap penalty: -5  
     Gap length penalty: -1  
  
We can improve some of our alignments. For technical reasons of printability, the two sequences are shortened to tgatggtcaagtaaactatgaag and atggtaatggcacaattgacttt (which does not affect the principle). Examples of alignments could be
 
tgatggtcaa.gtaaactat.gaag       score: 14*2+5*(-1)+4*(-5)+4*(-1) = -1  
  ||||| || |   || || ||                14 match 5 mismatch         
  atggt.aatg.gcacaattgacttt             4 gaps 4 gap length   
  
   tgatggtcaagtaaactatgaag       score:9*2+11*(-1)+1*(-5)+4*(-1) = -2  
   || |  |   |  |    |||                9 match 11 mismatch         
  atggtaat...ggcacaattgacttt            1 gap 3 gap length   
The assignment of values is fairly arbitrary and allows for significant changes in the result. If we take the standard values of the gap program (as detailed below) we end up with the following calculation:
 
  
            Match value:  1  
         Mismatch value:  0  
            Gap penalty: -5  
     Gap length penalty: -0.3  
  
tgatggtcaa.gtaaactat.gaag       score: 14*1+5*(0)+4*(-5)+4*(-0.3) = -7.2  
  ||||| || |   || || ||                14 match 5 mismatch         
  atggt.aatg.gcacaattgacttt             4 gaps 4 gap length   
  
   tgatggtcaagtaaactatgaag       score:9*1+11*(0)+1*(-5)+4*(-0.3) =  2.8  
   || |  |   |  |    |||                9 match 11 mismatch         
  atggtaat...ggcacaattgacttt            1 gap 3 gap length   
As these two examples show, the insertion and modification of gaps allows for a very broad variation of possibilities. Additionally, the judgement of the two possible alignments is possibly not satisfactory as we have not evaluated other possible alignments of the two sequences. In particular, the optimisation of the score at a given set of parameters, i.e. the calculation of best alignment, would require to try out all possible combinations, which is a fairly time- consuming task. An automatic procedure, therefore, is required to evaluate the possible solutions extensively. The algorithm for this purpose was first described by Smith and Watermann and usually implemented in a dynamic programming approach.

Letter-by-Letter Alignment Prerequisites

We need a defined scoring table, which defines values for

More advanced implementations also use a parameter to define the change of the gap elongation penalty, which results in the creation of "affine gaps":

The values for a match must not necessarily be characterised by a single value. Amino acid comparisons, in particular, need to reflect the relation between two amino acids in more detail than a yes/no decision. The value for this comparison is read from a symbol comparison table which reflects either evolutionary or more pragmatic interpretations of symbol values.

Symbol Comparison Tables

Nucleic acid sequence alignment will succeed if the scoring proceeds according to a simple match/mismatch schema. Sophisticated approaches will include ambiguity tables and score a match of G to S as half the match score - S is the ambiguity letter for G or C. However, most DNA sequences follow a 4-letter alphabet and, therefore, will be satisfactorily handled with straightforward matching score calculation.

Protein comparisons are more sophisticated. One option to compare amino acids by property is to set up a classification for

etc.

As it is difficult to quantify these properties, measurable values could be used like

etc.

The comparison tables which are generated this way might be based on correct values, however, miss the target as the value of a comparison matrix will be determined by the potential to follow the basic paradigm which we imply in performing the alignment:

 
  
                DNA sequence   
         
       determines  |   
                   V  
                                                                
                protein sequence   
                  
       determines  |  
                   V  
                     
                secondary structure   
                  
       determines  |  
                   V  
                     
                tertiary structure (protein)  
                  
       determines  |  
                   V  
                     
                protein function  
  
Two approaches have been used successfully in standard implementations of biocomputing software:

Many more matrices have been proposed but are not currently widely used.

NOTE: It is essential to understand the impact of the symbol comparison matrix for the result of the calculation; in particular as the statistics or numerical values of results will reflect the underlying matrix. "Good" or "meaningful" alignments as numerically computed by a program are only as good as the algorithm, but the relevance for biology is also affected by the relevance or applicability of the matrix used for comparison.

Alignment Path Matrices

NOTE:

The following description tries to explain the method but does not show the real implementation for reasons of simplicity and brevity. Please refer to the original papers for further reference. Advanced users should skip this section and proceed reading the "programs"section .

The basic principle of an alignment path matrix is the stepwise creation of values. Figure E shows the initial step: As in the 'dotplot' program, sequences were painted in a crossword-like fashion. Instead of a dot, the match or mismatch value is printed.

Next, additional values are calculated by adding the value of the current field (match or mismatch value) in addition to the value of an alignment path seen before arriving at this step.

Figure F shows the alignment path matrix in an intermediate stage. Each new value is printed as the minimum of one of several possibilities. To get to the value (X) in the Figure, one could use one of several possible pathways:

 
  
                              one gap: 1 (previous) -5 (gap) -1 (mismatch)  
                             /                                    = -5  
                            / direct: -3 (previous) -1 (mismatch) = -4   
     +------+------+------+// one gap: 0 (previous) -5 (gap) -1 (mismatch)  
  c  |  -1  |  -2  |   0  /// one longer gap:                     = -6  
     +------+------______////----+     6 (previous) -5 (gap) -1 (gap length)   
  t  |  -1  |   1 /|  -3 /// -1  |                  -1 (mismatch) = -1  
     +------+------+-----||------+  
  g  |  -1  |  -2  |   0/ |   8  |  
     +------+------+-----/+------+     min of (-5, -4, -6, -1)    = -1  
  g  |  -1  |  -2  |   6/ |  -1  |     value printed, therefore:    -1  
     +------+------+------+------+  
  ...|      |      |      |  
         a      t      g      g   
  
To compute all beneficial fields quickly, some implementations do not compute "chanceless" pathways and reduce, therefore, memory and time requirements. This, typically, applies to the edges of sequence alignment path matrices (in our example, the upper left and lower right corner). The GCG program implementation claims a "value not guaranteed to be optimal" but usually the edges do not yield a much better value.



Figure E: Alignment path matrix, step 1 Scores are computed as outlined in the text.

 
  
  g -1  
  a  2  
  a  2  
  g -1  
  t -1  
  a  2  
  t -1  
  c -1  
  a  2  
  a  2  
  a  2  
  t -1  
  g -1  
  a  2  
  a  2  
  c -1  
  t -1  
  g -1  
  g -1  
  t -1  
  a  2  
  g -1  
  t -1  2 -1 -1  2 -1 -1  2 -1 -1 -1 -1 -1 -1 -1  2  2 -1 -1 -1  2  2  2  
     a  t  g  g  t  a  a  t  g  g  c  a  c  a  a  t  t  g  a  c  t  t  t  
    
Figure F: Alignment path matrix in an advanced stage. Scores are calculated as described in the text using the following values: direct: score + previous. score value = 2 gap: score + previous -5 mismatch = -1 gap length = -1 Note the (X) which is explained in detail in the text.
 
            
  g -1  1  3  
  a  2  1 -3  
  a  2 -2 -3  
  g -1 -2  6  
  t -1  4 -3  
  a  2 -2  0   
  t -1  1  0   
  c -1  1  0   
  a  2  1  0   
  a  2  1 -3  
  a  2 -2  0  
  t -1  1  0  
  g -1  1  3  
  a  2  1 -3  
  a  2 -2 -3  
  c -1 -2  0 (X)  
  t -1  1 -3 -1  
  g -1 -2  0  8  
  g -1 -2  6 -1  
  t -1  4 -3 -4  
  a  2 -2 -3  3  0  0  3 -3 -3  3  0  0 -3  0  0 -3 -3  0  6 -3 -3 -3  0  
  g -1 -2  4  1 -2  1 -2 -2  4  1 -2 -2 -2 -2 -2 -2  1  4 -2 -2 -2  1  1  
  t -1  2 -1 -1  2 -1 -1  2 -1 -1 -1 -1 -1 -1 -1  2  2 -1 -1 -1  2  2  2  
     a  t  g  g  t  a  a  t  g  g  c  a  c  a  a  t  t  g  a  c  t  t  t  
    
The final result of the alignments is painted in Figure G. The evaluation of the "best" alignment is achieved by seeking the highest value or in the edges, and processing along degressing score, starting (in our example) at the upper right moving downwards. Figure F shows this exemplarically for the two best diagonals found.



Figure G: Alignment path matrix in an completed stage. Scores are calculated as described in the text using values listed in Figure E. Note the positive numbers of the two most promising alignments.

 
    
  g -1  1  3 -1 -1  3  1  3  3 -2  0 -1 -1  2  0  4  5  7  4 11 12  6  3  
  a  2  1 -3  0  4  2  4  1 -4 -2  0  0  3  1  5  6  5  5 12 13  7  4  3  
  a  2 -2 -3  5  0  2  2 -4 -1  1 -2  4 -1  3  7  3  6 10 14  6  5  4  4  
  g -1 -2  6 -1 -2  0 -3  1  2 -1  2  0  1  5  2  5 11 12  7  6  5  5  6  
  t -1  4 -3 -1  1 -3  2  0 -3  0  1  0  6  3  6 12 10  8  7  6  5  7 13  
  a  2 -2  0 -1 -2  3 -2 -2  1  2  1  7  4  7 10  8  9  8  7  3  5 11 10  
  t -1  1  0 -1  1 -4 -4  2  3  2  5  5  5  8  9 10  9  5  1  4 12 11  6  
  c -1  1  0 -1 -5 -3 -1  4  3  6  6  3  9 10  8  7  6  2  2 11  9  4  3  
  a  2  1  0 -4 -2  0  5  4  4  1  4  7 11  9  8  7  3  3  9 10  5  4  3  
  a  2  1 -3 -1 -2  3  5  5  2  5  5 12  7  6  8  4  3  7 11  6  2  1  4  
  a  2 -2  0 -1  1  3  6  3  6  6 10  8  4  6  5  4  8  9  7  3 -2  5  4  
  t -1  1  0  2  1  4  4  7  7 11  6  5  4  3  5  9 10  2  4 -1  3  2 -1  
  g -1  1  3 -1 -1  5  5  8 12  7  3  2  4  6  7  8  3  5 -2  1  0 -3  1  
  a  2  1 -3 -4  0  6  8 10  5  4  2  5  7  8  9  2  0 -1  2  1 -2  2 -3  
  a  2 -2 -3 -1  1  7 11  3  2  3  3  8  6  7  3 -1 -2  0  2 -4  3 -2 -2  
  c -1 -2  0 -1  2  9  4  3  4 -1  6  7  5  1  0 -1  1  0 -3  4 -2 -1 -1  
  t -1  1 -3 -1 10  2  1  5 -2  1  8  3  2  1  0  2  1 -3  2 -1  1  1 -2  
  g -1 -2  0  8  0 -1  3 -2  2  9  1  0  0 -2 -3 -4 -3  3  0 -1 -1 -3  1  
  g -1 -2  6 -1 -3  4 -1 -2  7  2 -2  1 -2 -2 -5 -2  1  1 -5  0  4  2  1  
  t -1  4 -3 -4  5 -1 -1  5 -3 -2  2 -1 -1 -4 -1  2 -1 -4 -1  5  3  2 -4   
  a  2 -2 -3  3  0  0  3 -3 -3  3  0  0 -3  0  0 -3 -3  0  6 -2 -3 -3  0  
  g -1 -2  4  1 -2  1 -2 -2  4  1 -2 -2 -2 -2 -2 -2  1  4 -2 -2 -2  1  1  
  t -1  2 -1 -1  2 -1 -1  2 -1 -1 -1 -1 -1 -1 -1  2  2 -1 -1 -1  2  2  2  
     a  t  g  g  t  a  a  t  g  g  c  a  c  a  a  t  t  g  a  c  t  t  t  
    
Figure G: Alignment path matrix in an completed stage. Scores are calculated as described in the text using values listed in Figure E. Note the positive numbers of the two most promising alignments.
 
    
  g                                                             12    
  a                                                          13/  
  a                                                       14/  
  g                                                 ___12/  
  t                                             12/                   13  
  a                                           10/                  11/  
  t                                           |                 12/     
  c                                        10/              11/    
  a                                     11/                  |     
  a                                  12/                  11/   
  a                               10/                   9/     
  t                            11/                  10/    
  g                         12/                   8/   
  a                      10/                   9/    
  a                   11/                ___7/    
  c                 9/                7/    
  t             10/                8/    
  g           8/                9/    
  g        6/                7/    
  t     4/                5/   
  a  2/                3/    
  g                 1/    
  t              2/    
     a  t  g  g  t  a  a  t  g  g  c  a  c  a  a  t  t  g  a  c  t  t  t  
  
tgatggtcaagtaaac.tatgaag                 tgatgg.tcaagtaaactatgaag   
  ||||| | |  |   |  ||                   | ||||  ||| |  ||| |    
  atggtaatggcacaatt.gacttt           atggtaatggcacaatt.gacttt     
    


Comparison Programs

Programs may distinguish between the

Two Sequences of Similar Length

Best suited for comparing homologous sequences from different species, or similar sequences with approximately the same length:

$ gap

Two Sequences of Different Length

Best suited for comparing sequences discovered in searches, or sequences with site homology rather than integral similarity.

$ bestfit

DNA and Protein Sequences

Best suited for comparing sequences discovered in searches, or sequences with site homology and a suspicious reading frame shift.

$ framealign

Programs to Display Two Aligned Sequences

Text

publish can display alignments (DNA or protein) in formatted fashion. The EGCG version epublish can display the translation of the second sequence in 1 or 3 letter code.

Graphic

Best suited for visualising overlaps or regions of homology. Needs Graphics - remember to have set the graphics environment with setplot correctly if you work with GCG locally. X-Windows setups have to set the DISPLAY environment correctly.

$ gap/out

or

$ bestfit/out

Next, display the graphics using the ".out" files generated by the commands given above.

$ gapshow

Significance Evaluation

Preparation of Data

Randomisation during Alignment

Best suited for estimating whether the alignment produced is (statistically) significant. Should be used with significantly more than the default 10 randomisation's (try at least 50).

$ gap/ran=50

or

$ bestfit/ran=50

If you happen to have access to W.Pearson's sequence analysis software you could try the rdf (or rdf2 ) program. This requires that you first convert the sequences of interest to STADEN format with the program tostaden (or the program readseq )

================================= Begin Exercise 10

Pairwise sequence analysis: Understand the use of comparison matrices in the alignment procedure of protein sequences. Apply different algorithms to the sequences obtained from DNA after translation, and evaluate significance of the result on both DNA and Protein level.

Using previous exercise results, you should have two DNA sequences by now: my1.seq as the typed-in sequence, and my2.seq as the reading-frame extracted DNA sequence from the seqed exercise. You shall compare these two sequences now on DNA and protein level. If you haven't translated the sequences to protein level already, you should do this now.

To solve this problem, follow this schedule:

================================= End Exercise 10


JAM produced file: COMPARI9.HTML as
[next page] , or [overview] , or [table of contents]