Chapter 10: Searching Patterns

Programs to find patterns in DNA sequences were already mentioned in the single sequence analysis section . The programs mentioned there search for patterns using compositional analysis. However, restriction mapping programs do also use pattern approaches.



Section 10.1: Pattern Principles

[ Previous chapter ][ This chapter ][ Next chapter ] Computers are fast if the comparison of letters or numbers tests whether these are equal or not. In order to apply this principle to biology, we need to define this "matching" in a simple table. E.g., if we compare two protein sequences and find a letter D, this is a symbol for aspartic acid. The test for "equal" or "not equal", therefore, reads as

 

  
       D    meeting D            => match 
  
            meeting any other    => mismatch
  

  
This is not very sophisticated, in particular, as certain proteins with specific sites (such as calcium binding) tend to be ambivalent and allow either aspartic or glutamic acid. We could use the comparison matrices as mentioned in the pairwise sequence comparison section. As already explained there, the comparison of sequences on the basis of matrices is computationally expensive. This is not the main reason, however, for using patterns.

The main benefit of patterns is that defined substitutions can occur as the result of a combination of examples - in contrast to a sequence comparison based on a comparison matrix derived from a sequence-independent generalisation of alignments.


Subsection 10.1.1

Example of Pattern Benefit

This example refers to a calcium binding site where we have proven examples of sequences containing E or D at a given position. To write our pattern, we will allow either of the two sequence characters to be a 'match' at this position. An example of this in a short alignment stretch is printed below:

 

  
       sequence fragment 1:     GDRD
  
       sequence fragment 2:     GERL
  

  
Our sequence match table, therefore, reads for aspartic acid (for the symbol D in position 2)
 
       D    meeting D            => match 
  
            meeting E            => match 
  
            meeting any other    => mismatch
  
A suitable matrix would allow this as well. However, keep in mind that a matrix covers symbol pairings at any position. To make this clearer, look at the fourth position of the sequence fragments above. If we have a leucine residue at a position where we find aspartic acid usually, this were (for the symbol D in position 4)
 

  
       D    meeting D            => match 
  
            meeting L            => match 
  
            meeting any other    => mismatch
  

  
Again, a suitable matrix would possibly cover this occurence, but we would have a problem with our first case: As we now allowed L in parallel to D, this were bad for our definition D or E earlier. The solution, therefore, is to define substitutions as a function of the sequence:
 

  
       D    meeting D            => match 
  
            meeting E at pos. 2  => match 
  
            meeting L at pos. 4  => match 
  
            meeting any other    => mismatch
  

  


Subsection 10.1.2

Definition of a Pattern Language

Our example requires that we apply different criteria to different positions of a sequence alignment. In order to have the position-specific comparison done by a computer, a sophisticated program is required which will do this type of calculation for us. Note the difference to the pairwise alignment programs : The comparison of symbols was position-independent there. Now, using patterns, we do position-dependent comparison calculations. As we cannot expect to get a specific program for each special pattern, we need to have a pattern matching program which will define the rules of patterns in a pattern language.

The creation of such a convention to describe patterns in a flexible fashion is not as difficult as one might assume because patterns are typically short in length (a few amino acids up to some dozen, but rarely more than hundred symbols) if compared to a query sequence which we want to screen for the occurrence of a pattern. By reading pattern definitions, the pattern matching program will be able to search a given sequence for this pattern defined in a specific language. There are various ways to define such a language, and the "regular expressions" of some essential programs of the UNIX operating system use such a language. The GCG software package searches with a straightforward definition, the most important features are listed below.

 
       sequence fragment 1:     GDRD
  
       sequence fragment 2:     GERL
  

  
       pattern description:     G(D,E)R(D,L)
  

 
       sequence fragment 1:     GDGTRD
  
       sequence fragment 2:     GERL
  
       sequence fragment 3:     GDMRD
  

  
       pattern description:     G(D,E)(X){0,2}R(D,L)
  

 
       sequence fragment 1:     GDD
  
       sequence fragment 2:     GEE
  
       sequence fragment 3:     GNN
  
       sequence fragment 4:     GQQ
  

  
       pattern description:     G~(A,C,F,G,H,I,K,L,M,P,R,S,T,V,W,Y)(D,E,N,Q)
  

[
previous chapter ],[ this chapter ][ next chapter ] , [next page/section] , or [overview] , or [table of contents]