JAMF Archive

BioCompanion as published in 1995
THIS IS THE REFERENCE CODE AS PUBLISHED.
		Doelz, R.   
		Optimal production of biological documentation: the JAM format.
		Comput. Applic. Biosci. 11, 224-226 (1995).    
		
The version you are currently viewing is the one printed and distributed via the Internet from the server of BioComputing Basel. Version 3.1 of the BioCompanion was published with version 2 of the JAMF software. The server that was indicated in the documentation has ceased to exist.

Version 3.2 of the BioCompanion was not publicly available for free but was shareware that was distributed with GCG's software release 9. For the purpose of enhanced editing, JAMF was partially rewritten and the proprietary version 3.x of JAMF was used from 1996 onwards. The Biocompanion is available in a current version from the publisher . It has significantly changed both in software and content.

JAMF source code

LATEX version source code

	

location: Home > Archive > BioCompanion V2.x (1995)

Chapter 10: SearchingPatterns

Searching Patterns

Programs to find patterns in DNA sequences were already mentioned in the single sequence analysis section . The programs mentioned there search for patterns using compositional analysis. However, restriction mapping programs do also use pattern approaches.


Pattern Principles

Computers are fast if the comparison of letters or numbers tests whether these are equal or not. In order to apply this principle to biology, we need to define this "matching" in a simple table. E.g., if we compare two protein sequences and find a letter D, this is a symbol for aspartic acid. The test for "equal" or "not equal", therefore, reads as

 
  
       D    meeting D            => match   
            meeting any other    => mismatch  
  
This is not very sophisticated, in particular, as certain proteins with specific sites such (as calcium binding) tend to be ambivalent and allow either aspartic or glutamic acid. We could use the comparison matrices as mentioned in the pairwise sequence comparison section. As already explained there, the comparison of sequences on the basis of matrices is computationally expensive. This is not the main reason, however, for using patterns.

The main benefit of patterns is that defined substitutions can occur as the result of a combination of examples - in contrast to a sequence comparison based on a comparison matrix derived from a sequence-independent generalisation of alignments.

Example of Pattern Benefit

This example refers to a calcium binding site where we have proven examples of sequences containing E or D at a given position. To write our pattern, we will allow either of the two sequence characters to be a 'match' at this position. An example of this in a short alignment stretch is printed below:

 
  
       sequence fragment 1:     GDRD  
       sequence fragment 2:     GERL  
  
Our sequence match table, therefore, reads for aspartic acid (for the symbol D in position 2)
 
       D    meeting D            => match   
            meeting E            => match   
            meeting any other    => mismatch  
A suitable matrix would allow this as well. However, keep in mind that a matrix covers symbol pairings at any position. To make this clearer, look at the fourth position of the sequence fragments above. If we have a leucine residue at a position where we find aspartic acid usually, this were (for the symbol D in position 4)
 
  
       D    meeting D            => match   
            meeting L            => match   
            meeting any other    => mismatch  
  
Again, a suitable matrix would possibly cover this occurence, but we would have a problem with our first case: As we now allowed L in parallel to D, this were bad for our definition D or E earlier. The solution, therefore, is to define substitutions as a function of the sequence:
 
  
       D    meeting D            => match   
            meeting E at pos. 2  => match   
            meeting L at pos. 4  => match   
            meeting any other    => mismatch  
  

Definition of a Pattern Language

Our example requires that we apply different criteria to different positions of a sequence alignment. In order to have the position-specific comparison done by a computer, a sophisticated program is required which will do this type of calculation for us. Note the difference to the pairwise alignment programs : The comparison of symbols was position-independent there. Now, using patterns, we do position-dependent comparison calculations. As we cannot expect to get a specific program for each special pattern, we need to have a pattern matching program which will define the rules of patterns in a pattern language.

The creation of such a convention to describe patterns in a flexible fashion is not as difficult as one might assume because patterns are typically short in length (a few amino acids up to some dozen, but rarely more than hundred symbols) if compared to a query sequence which we want to screen for the occurrence of a pattern. By reading pattern definitions, the pattern matching program will be able to search a given sequence for this pattern defined in a specific language. There are various ways to define such a language, and the "regular expressions" of some essential programs of the UNIX operating system use such a language. The GCG software package searches with a straightforward definition, the most important features are listed below.

 
       sequence fragment 1:     GDRD  
       sequence fragment 2:     GERL  
  
       pattern description:     G(D,E)R(D,L)  

 
       sequence fragment 1:     GDGTRD  
       sequence fragment 2:     GERL  
       sequence fragment 3:     GDMRD  
  
       pattern description:     G(D,E)(X){0,2}R(D,L)  

 
       sequence fragment 1:     GDD  
       sequence fragment 2:     GEE  
       sequence fragment 3:     GNN  
       sequence fragment 4:     GQQ  
  
       pattern description:     G~(A,C,F,G,H,I,K,L,M,P,R,S,T,V,W,Y)(D,E,N,Q)  


Creation of Patterns

Generally, pattern creation from protein sequences is more straightforward and requires less experience and time than DNA patterns. You might review the corresponding section for details on nucleotide patterns and limitations of the method.

Iterative Schedule

The creation of own patterns involves several steps.

Considerations: Pattern Sensitivity

Length

If the pattern is long, the sensitivity might be rather high. However, the sensitivity will decrease if the basis for the pattern creation was too small. If you use a single sequence as pattern, this will result in a pattern search which identifies this (and only this) sequence.

Information Content

If two or very similar sequences are used for pattern creation, the information added by multiple sequences might be low. The reason for this is obvious as the options to decide on "conserved" residues is biased by the similarity of the sequences. If four sequences, aligned, show identical residues, the information content of the pattern is identical to an individual sequence fragment.

False Positives

If a pattern is too stringent, it will miss sequences which have a slight variant in the desired sequence motif. It might be, however, that the sequences used for alignment will result in a pattern which is well-defined but occurs in a totally different environment. Pattern refinement might help only if the "false positives" can be discriminated by flanking sequences. The current software, however, does not permit exclusion on additional information such as occurrence in specific cellular compartments or similar.

Significance

Patterns with a very large flexibility (such as "XXXXX" in its extreme) will not be useful. Additionally, considerations on the statistical expectation apply. If we have a pattern of four amino acids, the probability to meet this pattern by chance is calculated assuming that the distribution of amino acids is "random". As we have 20 amino acid symbols, the probability of a single residue to match is one amongst 20, which is

 
  
                   1  
                 -----   =  5%  
                  20  
  
Four residues, therefore, account for
 
  
        1      1      1      1  
      ---- * ---- * ---- * ----   =   0.00000625 = less than  0.001 %  
       20     20     20     20  
  
This seems to be rather significant. Let us consider the example as derived above and compute the probability of two options:
 
  
          sequence fragment 1:     GDRD  
          sequence fragment 2:     GERL  
  
option 1: pattern description:     G(D,E)R(D,L)  
option 2: pattern description:     G(D,E)X(D,L)  
  
The probability of option 1 is calculated with changed values at position 2 and 4 as we have two alternatives each:
 
  
        1      1      1      1  
      ---- * ---- * ---- * ----   =    0.000025 = 0.0025 %  
  
       20     10     20     10  
  
The probability of option 2 has a "joker" in position 3, and as we will not find sequences less than four residues in sequence databases we can set the probability to find an X in a tetra-peptide at position three as 1:
 
  
        1      1             1  
      ---- * ---- * 1 * ----   =    0.0005 = 0.05 %  
       20     10            10  
  
The value of less than a promille looks low, but is much too high in order to allow a very sensitive searching. Assume a database of 50000 sequences and further that a match can occur only once. (This is a crude guess as the chance of pattern occurrence depends on the composition). By chance, we would have 0.0005 multiplied by 50000 which results in 25 hits - by random chance!


Programs

The 'findpatterns' Program

findpatterns searches databases (e.g., genembl:*), a file of sequence names (e.g., @my.fil, or my.msf{*}, see later ), or single sequences (e.g., my.seq) for patterns. The patterns are reported with exact matches, as shown below. If databases are searched, the nomonitor option is recommended.

Databases available at Basel University include:

 
Database name          GCG name            contents   
----------------------------------------------------------------  
EMBL + Updates     
GENBANK + Updates   
(GB as exclusion set)  GENEMBL:            all DNA databases (1)  
  
SWISSPROT              SWISSPROT:          most proteins (2)  
PIR International      PIR:                most proteins  
PATCHX + PIR           MIPSX:              MIPS merged database (3)  
  
NEW entries of EMBL    XEMBL:              EMBL new entries (4)   
UPDATED entries EMBL   XXEMBL:             EMBL updated entries (5)  
  
GENBANK update excl.   GB_NEW:             GENBANK exclusion (6)  

1) The definition of GENEMBL can vary. Depending on the location, you can use either GENBANK with an exclusion set of EMBL data not found in GENBANK, or vice versa (e.g., in Basel). Depending on whether you are connected to a network which is used to update data on a periodic basis, the GENEMBL set may include also daily updates.

2) containing weekly updates

3) PATCHX is updated quarterly and includes the previous release of SWISSPROT, an automatic translation of EMBL, and some other databases.

4) The definitions vary. XEMBL, EM_NEW, EMBL_DAILY, GB_NEW, XSWISS, SW_NEW, PIR4, etc. are names that denote the character of the preliminary entries.

5) This is a Basel-specific item. The main purpose of this database is to find new data in the annotation, as updates rarely include changes in the sequence. In order to have the main EMBL database show not too many entries in FASTA runs, the XXEMBL database is not included in the usual GENEMBL set.

6) This is a Basel-specific item. The weekly updated GENBANK database is calculated against EMBL and XEMBL to find those entries which are not in the EMBL updates yet. Additional databases are available at Basel. Their names are displayed when you start the molecular biology environment. Examples are Amos Bairoch's PROSITE database of protein motifs, or Rich Robert's REBASE database of restriction enzymes.

NOTE: The term GENEMBLPLUS, introduced in GCG version 8.1, is equivalent to GENEMBL. This is a deviation from the standard GCG installation which uses GENEMBL:* to describe all databases except EST and STS sections.

$ findpatterns/nomon

 
FINDPATTERNS identifies sequences with short pattern queries like   
GAATTC or YRYRYRYR.  You can define the patterns ambiguously and   
allow mismatches. You can provide the patterns in a file or simply   
type them in from the terminal.   
FINDPATTERNS in what sequence(s) ?  genembl:*   
Enter patterns individually, one per line. End the list with a blank line.   
               Pattern 1:   G(D,E)(X){0,2}R(D,L)                  
               Pattern 2:  
What should I call the output file (* FindPatterns.Find *) ? <RETURN>          
  
The data can also be searched using the mismatch option, which allows a pre-defined number of matches. Depending on the question asked, the output can be fairly voluminous.

$ findpatterns/mismatch=4

A PROSITE Database Searching Program

PROSITE is the protein site database from A.Bairoch. It can be searched with

$ motifs

If the full text of the abstract is required it can also be searched with

$ motifs /reference

The normal PROSITE search for a pattern does not include "frequently" found patterns such as glycosylation sites. If you want those to be shown as well use

$ motifs /frequent

The SRS system allows you to search for annotation items in the PROSITE database effectively. After a search in PROSITE or PROSITEDOC any resulting hit can be linked into any other sequence database. Similarly, any EMBL or SWISSPROT entry can be linked into the PROSITE database within navigation mode. Alternatively, a whole set can be linked with

[X] Expression

and then something like

SQ1 > PROSITE

As a result from research projects, other protein pattern databases have been produced, and are available in various ways, such as BLOCKS and PRODOM.

BLOCKS and PRODOM are not installed for protein searching but HASSLE access is in preparation. Both databases are available, however, in the SRS system (see also description in the related section ).

================================= Begin Exercise 11

Patterns: Use of the motif searching program to detect motifs in a protein sequence. Definition of an own pattern derived from previous analysis.

 
|Vertical:from-to | Horizontal:from-to|	weak/strong/other  
|-----------------+-------------------+------------------  
|                 |                   |    
|                 |                   |    
|                 |                   |    
|                 |                   |    
|                 |                   |    
|                 |                   |    
|                 |                   |    
  

 
|  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 | 11 | 12 | 13  
|----+----+----+----+----+----+----+----+----+----+----+----+----  
|    |    |    |    |    |    |    |    |    |    |    |    |  
|    |    |    |    |    |    |    |    |    |    |    |    |  
|    |    |    |    |    |    |    |    |    |    |    |    |  
|    |    |    |    |    |    |    |    |    |    |    |    |  
|    |    |    |    |    |    |    |    |    |    |    |    |  
|    |    |    |    |    |    |    |    |    |    |    |    |  
|    |    |    |    |    |    |    |    |    |    |    |    |  
  

================================= End Exercise 11


JAM produced file: SEARCH10.HTML as
[next page] , or [overview] , or [table of contents]