Section 10-2: Creation of Patterns

[ Previous chapter ][ This chapter ][ Next chapter ] Generally, pattern creation from protein sequences is more straightforward and requires less experience and time than DNA patterns. You might review the corresponding section for details on nucleotide patterns and limitations of the method.


Subsection 10.2.1

Iterative Schedule

The creation of own patterns involves several steps.


Subsection 10.2.2

Considerations: Pattern Sensitivity

Length

If the pattern is long, the sensitivity might be rather high. However, the sensitivity will decrease if the basis for the pattern creation was too small. If you use a single sequence as pattern, this will result in a pattern search which identifies this (and only this) sequence.

Information Content

If two or very similar sequences are used for pattern creation, the information added by multiple sequences might be low. The reason for this is obvious as the options to decide on "conserved" residues is biased by the similarity of the sequences. If four sequences, aligned, show identical residues, the information content of the pattern is identical to an individual sequence fragment.

False Positives

If a pattern is too stringent, it will miss sequences which have a slight variant in the desired sequence motif. It might be, however, that the sequences used for alignment will result in a pattern which is well-defined but occurs in a totally different environment. Pattern refinement might help only if the "false positives" can be discriminated by flanking sequences. The current software, however, does not permit exclusion on additional information such as occurrence in specific cellular compartments or similar.

Significance

Patterns with a very large flexibility (such as "XXXXX" in its extreme) will not be useful. Additionally, considerations on the statistical expectation apply. If we have a pattern of four amino acids, the probability to meet this pattern by chance is calculated assuming that the distribution of amino acids is "random". As we have 20 amino acid symbols, the probability of a single residue to match is one amongst 20, which is

 

  
                   1
  
                 -----   =  5%
  
                  20
  

  
Four residues, therefore, account for
 

  
        1      1      1      1
  
      ---- * ---- * ---- * ----   =   0.00000625 = less than  0.001 %
  
       20     20     20     20
  

  
This seems to be rather significant. Let us consider the example as derived above and compute the probability of two options:
 

  
          sequence fragment 1:     GDRD
  
          sequence fragment 2:     GERL
  

  
option 1: pattern description:     G(D,E)R(D,L)
  
option 2: pattern description:     G(D,E)X(D,L)
  

  
The probability of option 1 is calculated with changed values at position 2 and 4 as we have two alternatives each:
 

  
        1      1      1      1
  
      ---- * ---- * ---- * ----   =    0.000025 = 0.0025 %
  
       20     10     20     10
  

  
The probability of option 2 has a "joker" in position 3, and as we will not find sequences less than four residues in sequence databases we can set the probability to find an X in a tetra-peptide at position three as 1:
 

  
        1      1          1
  
      ---- * ---- * 1 * ----   =    0.0005 = 0.05 %
  
       20     10         10
  

  
The value of less than a promille looks low, but is much too high in order to allow a very sensitive searching. Assume a database of 50000 sequences and further that a match can occur only once. (This is a crude guess as the chance of pattern occurrence depends on the composition). By chance, we would have 0.0005 multiplied by 50000 which results in 25 hits - by random chance!
[
previous chapter ],[ this chapter ][ next chapter ] , [next page/section] , or [overview] , or [table of contents]