Section 11-2: Sequence Searching with Heuristic Methods

[ Previous chapter ][ This chapter ][ Next chapter ]


Subsection 11.2.1

Principle of Similarity Detection

A powerful heuristic of a fast screening can be expressed as follows:

Two sequences are similar if a sufficient number of identical oligomers is found at a given arrangement of the two.

The internal loop of a sequence searching program (see above) requires the identification of a score, which is a numerical value describing the quality of a sequence comparison. The observed score depends on

The numerical values of scores are specific for a given program and must not be used for comparison between different algorithms or result output. Unless a statistical significance or another type of normalisation is computed, (see below) the score values are relative in relation to the search sequence and its algorithm.

The following example shall demonstrate how a heuristic sequence search algorithm might be implemented. We do this on DNA level but no fundamental difference will be seen in protein comparisons, as the heuristic algorithms score with identities of short sequence fragments rather than using comparison tables . To a certain extent, the comparison using identities of oligomers can be visualised as a dotplot-type of comparison. We reuse the example of earlier chapters, with

 

  
atggtaatggcacaattgactttcctgaatttctga   Seq. A (formerly, horizontal sequence)
  
tgatggtcaagtaaactatgaagagttt           Seq. B (formerly, vertical sequence)
  

  
In contrast to the dotplot, however, we now create a table of oligomers and note the location of the occurrence of these oligomers in the sequences. We are going to use di-nucleotides here, which is for the sake of brevity only, as nucleotide sequences typically use larger "words" (such as six or more). Sixteen oligomers are possible, but not all of them are found in the two sequences. We tabulate the oligomers and the begin of the sequence position for sequences A and B as follows:
 

  
      A           B                            A            B
  

  
AA    6,14,28     9,13,14,21             CA    11,13        8
  
AC    12,19       15                     CC    24           -
  
AG    -           10,22,24               CG    -            -
  
AT    1,7,15,29   3,18                   CT    20,25,33     16
  

  
GA    18,27,35    2,20,23                TA    5            12,17
  
GC    10          -                      TC    23,32        7
  
GG    3,9         5                      TG    2,8,17,26,34 1,4,19
  
GT    4           6,11,25                TT    16,21,22,30,31 26,27
  

  
Having created such a table, we made already progress to determine the largest segment of identity between the two sequences. The next question to answer is whether we can find oligomers which match at a given, but identical shift of the two sequences. This can be readily achieved by calculating the difference of sequence position (called a shift ) between A and B for all oligomeris of the previous table. We now tabulate the position start of sequence A versus the shift to sequence B:
 
Shift
  
observed    -10 -9 -8 -7 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  6  7  8  9  10
  
(in A) ----------------------------------------------------------------------
  

  
Locations    25     8    13  7 11     2 14     1  6 22 18  4  5  6  9  16
  
                      26    14  8       14        2 12    21 21  6   
  
                      27    30  9                 3 15    22    14
  
                      28    31 19                 4
  
                               20                17
  
                               27                18
  
                               30
  
                               31
  
The result could be counted in a very primitive fashion as in the dotplot-type of example. In dotplots, a point was shown for each matching oligomer. The columns in the table above represent identities of a diagonal. The most matches on a single diagonal occurs at a shift of -4 from sequence A to sequence B, compare this to what was already observed in the schematic comparison section . Similarly, the second-best column with a shift of +2 was already known. However, this type of crude approach might not be sufficient to find the best matching segment of identity as we ignore gaps entirely and just count occurences of oligomer matches on a diagonal.

To improve sensitivity, we need to search the columns of identical shift which occur at continuously increasing origin, such as 7,8,9 in the column of shift -4 above. Doing this, we find the following table of diagonals, sorted by the length of the observed diagonal

 

  
Length = (number of oligomers -1) + (size of oligomer) 
  

  
Segment with shift          2    -4    -7    -5  
  
Segment start position      1     7    26    13    
  
Length                      5     3     4     3
  

  
The occurrence of mismatches will be noticed as an interruption of the segment. Gaps will occur as a change of shift. Interruptions and changes of shift, therefore, will not be considered in the initial identification, and many algorithms will not consider gaps at later stages either.

The 'fasta' program (as listed in the "programs" section below) has the init scores in the first two columns of its numerical output which refer to the crude and the joined diagonals, respectively. Joined diagonals are found by extending the segment. The extension will be counted if adjacent diagonals can be joined to increase the overall score of the newly found segment.

The 'blast' type of programs, also mentioned below, will gain speed by determining the occurrence of oligomers in the database sequences in a preprocessing step (to be executed once only at formatting time of the database) before the actual search is started. This reduces the need to scan the database each time for every comparison. However, as the pre-computation is done on the entire database, the arbitrary search of subsections of the database will no longer be possible without executing the format process into a 'blast' database).

After collecting sequences of interest, the ranking and display of the found alignments is performed, which is also a crucial step for successful result evaluation. The most difficult problem of reliable sequence similarity searching is the proper selection of criteria which describe statistical relevance and map as advantageous as possible to biological relevance.

The 'fasta' type of programs will redo the comparison of the top-score entries as found in the database with a rigorous algorithm (such in the GCG program bestfit and in rigorous searches as described below). A more accurate alignment of the database sequences with the query sequence, which eventually goes along with a re-scoring of the hits, will allow to increase the power of a sequence comparison. It should be kept in mind that the re-scoring (in the 'fasta' programs called the opt score) will not be reflected in the listing order of top-scoring entries as these are sorted for the so-called "initn" score (the joined fragment score).

NOTE: The fasta 2.x version of this package adds very comprehensive statistics. However, at the time of writing, the GCG software did not include this version of the fasta package.

The 'blast' programs use a statistical approach in order to list only sequence segments in the result which are not expected by chance. To do this, it is necessary to determine a score which can discriminate between "hit" and "random". The information unit is a "bit" and might be explained as follows: The amount of information stored in biological sequences is, from a statistical point of view, a function of the sequence composition. E.g., a poly-A stretch in a DNA sequence has the information only A and the length of the sequence. Information can be measured in bits, which is either YES or NO, or 0 or 1 in computer science world. A nucleotide, therefore, which can be any of four characters, can be written in two bits as the relevant information of four different symbols requires four different combinations:

 

  
     A 00
  
     G 01
  
     C 10
  
     T 11
  

  
The information contents of a match, if expressed in bits of information, will therefore contribute to the ranking of the sequence. The 'blast' type of programs uses this method to determine the relevance of a sequence searching result. Additionally, the probability of finding the segment of interest by chance is calculated on the basis of the database and query information content. The lower the probability in its value, the higher the significance, as the hit will less favourably occur by chance. The concept of information contents can be used to introduce thresholds in sequence comparison, which will determine whether a match is carried or a score is computed at all: If the information content of a sequence is below such a threshold, no comparison will be made, and the result of a search will be that no relevant hits will be reported. This is a rare case but will occasionally occur in 'blast' searches.


Subsection 11.2.2

Expectations

The amount of data you will need to review after a sequence search might be enormous. To help you evaluating the result, most searching programs will permit to generate a histogram which displays the scoring table graphically: The number of found scores is plotted against the score. The very large peak in the area of low scores can be attributed to "random" hits which occur by chance. Depending on algorithm and sequence, the hits of remote similarity will occur in the downhill area of this peak. Sequences from other organisms score at higher values, and the identity score is, typically, at vary high values if a sequence is found in the database which matches the query sequence directly. Figure I shall illustrate this description of a search histogram .

Figure I: Score distribution as observed in a sequence similarity search. The peak at low scores has been purposely truncated and is not shown in its full size.

 

  
number 
  
of hits
  

  
^    
  
|    ////   statistical
  
|    *  *   noise 
  
|    *  *  <----------
  
|    *  *
  
|    *  *                   species 
  
|    *   *                similarity
  
|    *    *   remote       score
  
|   *     *  similarity      |            
  
|   *      *   |             |          identities
  
|   *       *  v             v             |
  
|  *         * *            *              v
  
| *             *          * ***           *
  
+-----------------*********-----***********-***--> 
  
                                             score                                   
  

  

  
                                             
  
                                                                                                                                       
  
The discrimination between "remote similarity" and "statistical noise" is crucial and severely depends on the algorithm and the data used for comparison.

NOTE: Unless a reliable statistical estimation is part of the search program, you should investigate the region of the beginning "statistical noise" carefully. However, keep in mind that statistical relevance might not match biological requirements and might be misleading.

The significance of searches can be improved if the search is conducted on the protein level rather than the DNA level. This is due to the fact that codon usage differences between different algorithms will increase the number of mismatches even if the protein resulting from the two different sequence fragments will be identical. Programs are available which will search the DNA sequence database after an on-the-fly translation to all possible six reading frames. Doing these kinds of searches, two major considerations will apply:

The 'framesearch' program of the GCG package takes care of this effect to some extend, but utilises extreme resources for completion. This program uses a rigorous approach and is described later .


Subsection 11.2.3

Programs

% blast

This program will be able to do searches on protein level if you use DNA sequence databases. The original NCBI blast program package as available from the NCBI includes the following programs:

The GCG implementation of the 'blast' program suite uses a single program - 'blast' - which launches any of the programs mentioned above. (This is called a front-end program).

The "blast" suite is a program which may run either locally or via network. The "blast" system on your UNIX system might not run locally but rather via networks. The HASSLE system in Europe, or the NCBI server system in the US or generic GCG package installations, can use the network to send sequences to a server which computes the data for you.

NOTE: Be aware of the fact that sending sequences over networks is a potential breach of confidentiality or security. In particular, this applies to sites which do not operate on public data networks (i.e., for-profit installations).

Most useful databases available at the NCBI, or via the HASSLE system, include:

 

  
Database name          GCG name          contents 
  
----------------------------------------------------------------
  
EMBL + Updates   
  
GENBANK + Updates 
  
(GB as exclusion set)  nr      	         all DNA databases (1)
  
SWISSPROT              swissprot	 most proteins
  
SWISSPROT +
  
PIR International 
  
+ PATCHX + OWL         nr                all peptide databases (2) 
  

1) The definition for "nr" might vary. Depending on the location, you might use either GENBANK with an exclusion set of EMBL data not in GENBANK, or use EMBL with an exclusion set of GENBANK (eg, in Basel). Depending on whether or not you are connected to a network which is used to update the data on a periodic basis, the "nr" set includes also daily updates. At Basel, both EMBL and GENBANK are updated weekly, and EMBL is the basis of exclusion using the NCBI's 'nrdb' program.

2)The definition of "nr" might vary from site to site. At Basel, all available databases and their corresponding updates are computed on a weekly basis with NCBI's 'nrdb' program.

Additional databases are available at some sites, and will be displayed at the menu if you start the "blast" program suite. It might happen that the 'blast' suite of programs is not enabled or temporarily unavailable at the time of your command due to resource limitations. It should be kept in mind that 'blast' programs are good for screening but should always be supplemented with other screening methods (e.g., swsearch or MPsrch, see below) in order to confirm findings.

General purpose, fast and reliable searches are done using

% fasta

The GCG implementation of the 'fasta' program is typically a (s)lower version than the one distributed by the Author William Pearson. As statistical evaluations are not implemented yet, GCG users will have to run the shuffle program as described below. Problems with the 'fasta' programs are most usually observed when the database has been specified incorrectly. Use the database name and a colon with an asterisk to give the correct specification, following the instructions below.

Databases available at the local site usually include:

 
Database name          GCG name            contents 
  
----------------------------------------------------------------
  
EMBL + Updates   
  
GENBANK + Updates 
  
(GB as exclusion set)  GENEMBLPLUS:        all DNA databases (1)
  

  
SWISSPROT              SWISSPROT:          most proteins
  
PIR International      PIR:                most proteins
  
NEW entries of EMBL    EM_NEW:             EMBL new entries (2) 
  
GENBANK updates        GB_NEW:             GENBANK new entries (3)
  

1) The definition of GENEMBL can vary. Depending on the location, you can use either GENBANK with an exclusion set of EMBL data not found in GENBANK, or vice versa. Depending on whether you are connected to a network which is used to update data on a periodic basis, the GENEMBL set may include also daily updates.

2),3) The definitions vary. XEMBL, EM_NEW, EMBL_DAILY, GB_NEW, XSWISS, SW_NEW, PIR4, etc. are names that denote the character of the preliminary entries. Depending on your site and/or affiliation, those entries which are not found in either the EMBL or GENBANK update sets yet, possibly show up in the corresponding other set as so-called "exclusion set". Other site's GB_NEW and EM_NEW may contain all entries of GENBANK and EMBL, respectively, which can cause duplications.

NOTE: The term GENEMBLPLUS, introduced in GCG version 8.1, is equivalent to GENEMBL, which was used before version 8.1.

If you need to search for peptide in the DNA database, this can be achieved with the tfasta program, which translates the DNA database on-the-fly into all possible reading frames. This search is computationally more expensive but considered to be more sensitive (see above, and below).


[ previous chapter ],[ this chapter ][ next chapter ] , [next page/section] , or [overview] , or [table of contents]