Chapter 11: Sequence Searching



Section 11.1: Tools for Sequence Searching

[ Previous chapter ][ This chapter ][ Next chapter ] The requirement to identify a single sequence in a database is very much different from a keyword search. Keywords are expected to match exactly. This type of keyword searching has been described earlier .

Pattern searching programs will match a pattern exactly at single, defined positions in a sequence.

A sequence searching program , however, is expected to report

and, most important,

The major problem of sequence searching, therefore, is to find a reasonable definition for the similarity of sequences. Application programs doing "sequence searching" in general imply that each entry of a sequence database is compared to the query sequence sequentially, and the result is a list of database entries which are similar to the query sequence. The receipt of this program type reads as follows:

 

  
 start program 
  
 initialise top-scoring list
  
 for each entry in database                               \
  
    {                                                      |
  
     compare database sequence with query sequence         |
  
     evaluate similarity as "score"                        |search loop 
  
     compare "score" to top-scoring list                   |across all 
  
     if "score" better than lowest entry                   |entries in 
  
        {                                                  |the database
  
         place entry in top-scoring list                   |
  
        }                                                  |
  
    }                                                     /
  
 normalise top-scoring list
  
 for each entry in top-scoring list                       \ evaluation
  
    {                                                      |of result 
  
     determine statistics                                  |using all
  
     align search sequence and database sequence           |entries in 
  
     print out result                                      |top-scoring
  
    }                                                     / list 
  

  
Depending on the algorithm or implementation, some of the steps might be missing or integrated in other steps. Sophisticated sequence searching which performs extremely fast might be based on two approaches as described below.


Subsection 11.1.1

Sensitive searching, using special computers

The use of so-called rigorous searching algorithms such as the Smith and Watermann type of search already known from the 'bestfit' program , doing an accurate comparison. These programs execute the central searching loop in a specifically tuned fashion, and run best on special hardware such as the MasPar and Bioccelerator systems.


Subsection 11.1.2

Extremely fast searching, using approximations

The use of heuristic searching algorithms which reduce the compute time needed in the central searching loop. Approximations in comparison are based on the assumption that sequence similarity can be expressed as tightly accumulated regions of identical oligomers or short peptides. Either computed on-the-fly ( 'fasta' type of program) or in a pre-processing step ( 'blast' program type), the heuristic searching programs will run on smaller, general-purpose machines and even PCs might occasionally be used.


[ previous chapter ],[ this chapter ][ next chapter ] , [next page/section] , or [overview] , or [table of contents]