Section 10-3: Programs

[ Previous chapter ][ This chapter ][ Next chapter ]


Subsection 10.3.1

The 'findpatterns' Program

findpatterns searches databases (e.g., genembl:*), a file of sequence names (e.g., @my.fil, or my.msf{*}, see later ), or single sequences (e.g., my.seq) for patterns. The patterns are reported with exact matches, as shown below. If databases are searched, the nomonitor option is recommended.

Databases available at the local site usually include:

 
Database name          GCG name            contents 
  
----------------------------------------------------------------
  
EMBL + Updates   
  
GENBANK + Updates 
  
(GB as exclusion set)  GENEMBLPLUS:        all DNA databases (1)
  

  
SWISSPROT              SWISSPROT:          most proteins
  
PIR International      PIR:                most proteins
  
NEW entries of EMBL    EM_NEW:             EMBL new entries (2) 
  
GENBANK updates        GB_NEW:             GENBANK new entries (3)
  

1) The definition of GENEMBL can vary. Depending on the location, you can use either GENBANK with an exclusion set of EMBL data not found in GENBANK, or vice versa. Depending on whether you are connected to a network which is used to update data on a periodic basis, the GENEMBL set may include also daily updates.

2),3) The definitions vary. XEMBL, EM_NEW, EMBL_DAILY, GB_NEW, XSWISS, SW_NEW, PIR4, etc. are names that denote the character of the preliminary entries. Depending on your site and/or affiliation, those entries which are not found in either the EMBL or GENBANK update sets yet, possibly show up in the corresponding other set as so-called "exclusion set". Other site's GB_NEW and EM_NEW may contain all entries of GENBANK and EMBL, respectively, which can cause duplications.

NOTE: The term GENEMBLPLUS, introduced in GCG version 8.1, is equivalent to GENEMBL, which was used before version 8.1.

% findpatterns -nomon

 
FINDPATTERNS identifies sequences with short pattern queries like 
  
GAATTC or YRYRYRYR.  You can define the patterns ambiguously and 
  
allow mismatches. You can provide the patterns in a file or simply 
  
type them in from the terminal. 
  
FINDPATTERNS in what sequence(s) ?  genembl:* 
  
Enter patterns individually, one per line. End the list with a blank line. 
  
               Pattern 1:   G(D,E)(X){0,2}R(D,L)                
  
               Pattern 2:
  
What should I call the output file (* FindPatterns.Find *) ? <RETURN>        
  

  
The data can also be searched using the mismatch option, which allows a pre-defined number of matches. Depending on the question asked, the output can be fairly voluminous.

% findpatterns -mismatch=4


Subsection 10.3.2

A PROSITE Database Searching Program

PROSITE is the protein site database from A.Bairoch. It can be searched with

% motifs

If the full text of the abstract is required it can also be searched with

% motifs -reference

The normal PROSITE search for a pattern does not include "frequently" found patterns such as glycosylation sites. If you want those to be shown as well use

% motifs -frequent

The SRS system allows you to search for annotation items in the PROSITE database effectively. After a search in PROSITE or PROSITEDOC any resulting hit can be linked into any other sequence database. Similarly, any EMBL or SWISSPROT entry can be linked into the PROSITE database within navigation mode. Alternatively, a whole set can be linked with

[X] Expression

and then something like

SQ1 > PROSITE


Subsection 10.3.3

Other Pattern Motif Databases

As a result from research projects, other protein pattern databases have been produced, and are available in various ways, such as BLOCKS, SBASE and PRODOM.

BLOCKS and PRODOM are usually not installed for protein searching within GCG programs.

================================= Begin Exercise 11

Patterns: Use of the motif searching program to detect motifs in a protein sequence. Definition of an own pattern derived from previous analysis.

 
|Vertical:from-to | Horizontal:from-to|	weak/strong/other
  
|-----------------+-------------------+------------------
  
|                 |                   |  
  
|                 |                   |  
  
|                 |                   |  
  
|                 |                   |  
  
|                 |                   |  
  
|                 |                   |  
  
|                 |                   |  
  

  

 
|  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 | 11 | 12 | 13
  
|----+----+----+----+----+----+----+----+----+----+----+----+----
  
|    |    |    |    |    |    |    |    |    |    |    |    |
  
|    |    |    |    |    |    |    |    |    |    |    |    |
  
|    |    |    |    |    |    |    |    |    |    |    |    |
  
|    |    |    |    |    |    |    |    |    |    |    |    |
  
|    |    |    |    |    |    |    |    |    |    |    |    |
  
|    |    |    |    |    |    |    |    |    |    |    |    |
  
|    |    |    |    |    |    |    |    |    |    |    |    |
  

  

================================= End Exercise 11


[next page] , or [overview] , or [table of contents]