JAMF ArchiveBioCompanion as published in 1995THIS IS THE REFERENCE CODE AS PUBLISHED. Doelz, R. Optimal production of biological documentation: the JAM format. Comput. Applic. Biosci. 11, 224-226 (1995).The version you are currently viewing is the one printed and distributed via the Internet from the server of BioComputing Basel. Version 3.1 of the BioCompanion was published with version 2 of the JAMF software. The server that was indicated in the documentation has ceased to exist. Version 3.2 of the BioCompanion was not publicly available for free but was shareware that was distributed with GCG's software release 9. For the purpose of enhanced editing, JAMF was partially rewritten and the proprietary version 3.x of JAMF was used from 1996 onwards. The Biocompanion is available in a current version from the publisher . It has significantly changed both in software and content. |
|||
|
|
|
||
Programs to find patterns in DNA sequences were already mentioned in the single sequence analysis section . The programs mentioned there search for patterns using compositional analysis. However, restriction mapping programs do also use pattern approaches.
Computers are fast if the comparison of letters or numbers tests whether these are equal or not. In order to apply this principle to biology, we need to define this "matching" in a simple table. E.g., if we compare two protein sequences and find a letter D, this is a symbol for aspartic acid. The test for "equal" or "not equal", therefore, reads as
D meeting D => match
meeting any other => mismatch
This is not very sophisticated, in particular, as certain
proteins with specific sites such (as calcium binding) tend to
be ambivalent and allow either aspartic or glutamic acid. We
could use the
comparison matrices as mentioned in the pairwise sequence
comparison section. As already explained there, the comparison
of sequences on the basis of matrices is computationally expensive.
This is not the main reason, however, for using patterns.
The main benefit of patterns is that defined substitutions can occur as the result of a combination of examples - in contrast to a sequence comparison based on a comparison matrix derived from a sequence-independent generalisation of alignments.
This example refers
to a calcium binding site where we have proven examples of sequences
containing E or D at a given
position. To write our pattern, we will allow either of the two
sequence characters to be a 'match' at this position. An example
of this in a short alignment stretch is printed below:
Our example requires that we apply different criteria to different
positions of a sequence alignment. In order to have the position-specific
comparison done by a computer, a sophisticated program is required
which will do this type of calculation for us. Note the difference
to the pairwise alignment
programs : The comparison of symbols was position-independent
there. Now, using patterns, we do position-dependent comparison
calculations. As we cannot expect to get a specific program for
each special pattern, we need to have a pattern matching
program which will define the rules of patterns in a
pattern language.
The creation of such a convention to describe patterns in a flexible
fashion is not as difficult as one might assume because patterns
are typically short in length (a few amino acids up to some dozen,
but rarely more than hundred symbols) if compared to a query
sequence which we want to screen for the occurrence of a pattern.
By reading pattern definitions, the pattern matching program
will be able to search a given
sequence for this pattern defined in a specific language. There
are various ways to define such a language, and the "regular
expressions" of some essential programs of the UNIX operating
system use such a language. The GCG software package
searches with a straightforward definition, the most
important features are listed below.
The residues to be allowed in a given position are listed with
a comma (,) as a separator and are embraced by parenthesis (
and ). To specify our example above,
it were sufficient to have
Occasionally, the number of residues to be allowed in a given
position becomes fairly large. Then, it is easier to define which
residues are not found at this place. The pattern
language uses a tilde (~) in front of a character or a list embraced
with parenthesis in order to define these undesired residues.
Note that, in the description of the four sequence motifs below,
the positions 2 and 3 are identical for the sake of demonstration.
Whereas exclusion was used to define position 2, the explicit
listing of allowed residues in position 3 seems to be more efficient
in this case. However, real-world examples exist where exclusion
can be used beneficially. Generally, pattern creation
from protein sequences is more straightforward and requires less
experience and time than DNA patterns. You might review the corresponding section
for details on nucleotide patterns and limitations of the method.
The creation of own patterns involves several steps.
The 'bestfit' program
calculates the best local homology as applied on several sequences.
It might be suited to identify patterns as revealed by a
'dotplot' generated from a
'compare' run. The identification of repetitive patterns
requires to use the pairwise sequence alignment program with
varying start- and end regions as only the best hits occurring
in a given sequence are reported.
A good approach is to use a piece of paper
and write down the sequences of interest. Next, decide whether
several amino acids can be defined as occurring "alternating",
e.g., E OR
D, to be written as (E,D). Before assigning an X (for "any"),
make sure that you cannot use the NOT in terms of that in a heterogeneous
motif you still have amino acids which are "never" observed
(e.g., never seen W or H is written as ~(W,H). Use the command
genhelp motifs define to view details in the
documentation how to write patterns or refer to the description
above .
Use the created pattern in order to re-screen the sequence(s)
you used to create the pattern. Make sure you found all sequence
fragments which were used to construct the pattern. Use the program
findpatterns (see below). If you do not detect
the pattern as expected, consider to weaken the stringency by
allowing mismatches in the pattern search (see below) and estimate
the loss of sensitivity caused by this flexibility - occasionally,
it is advantageous to have more alternatives (resulting in more
complex patterns) rather than allowing mismatches.
Consider to rerun the findpatterns program
with the option "mismatch=2" in order to allow for an even more
flexible definition. Locate variations which have been assigned
"X" and determine whether the found symbols permit a more detailed
definition, and check whether an extension of the pattern is
feasible. The output of
pattern searching programs usually show the "overhang" on
each side of the pattern. If you feel that you miss important
positions because this overhang is too small, append (X){10,10}
to each side of the pattern (which will add ten more
symbols to each side without affecting the rest of the pattern)
and rerun the pattern search.
Consider the output of the steps as performed above and evaluate
the possibility to write down more than one pattern. This is
possibly a more sensitive approach. To do this, fetch
the file "patterns.dat" and call a text editor. Once having started the EDIT or TPU editor,
enter your patterns into this file. Possibly, you will need to
have more than one pattern with similar length. The purpose of
this approach is to exclude the negative effects of arbitrary
combinations (see considerations below).
Rerun findpatterns with the new pattern file
on the whole database and validate your satisfaction...
Length
If the pattern is long, the sensitivity might
be rather high. However, the sensitivity
will decrease if the basis for the pattern creation
was too small. If you use a single sequence as pattern, this
will result in a pattern search which identifies this (and only
this) sequence.
Information Content
If two or very similar sequences are used for pattern creation,
the information added by multiple sequences might be low. The
reason for this is obvious as the options to decide on "conserved"
residues is biased by the similarity of the sequences. If four
sequences, aligned, show identical residues, the information
content of the pattern is identical to an individual sequence
fragment.
False Positives
If a pattern is too stringent, it will miss sequences which have a slight variant in the
desired sequence motif. It might be, however, that the sequences
used for alignment will result in a pattern which is well-defined
but occurs in a totally different environment. Pattern refinement
might help only if the "false positives" can be discriminated
by flanking sequences. The current software, however, does not
permit exclusion on additional information such as occurrence
in specific cellular compartments or similar.
Significance
Patterns with a very large flexibility (such as "XXXXX" in its
extreme) will not be useful. Additionally, considerations on
the statistical expectation apply. If we have a pattern of four
amino acids, the probability to meet this pattern
by chance is calculated assuming that the distribution of amino
acids is "random". As we have 20 amino acid symbols, the probability
of a single residue to match is one amongst 20, which is
findpatterns
searches
databases (e.g., genembl:*),
a file of sequence names (e.g., @my.fil, or my.msf{*}, see later ), or
single sequences (e.g., my.seq) for patterns. The patterns are
reported with exact matches, as shown below. If databases are
searched, the nomonitor option is recommended.
Databases available at Basel
University include: 1) The definition of GENEMBL can vary. Depending on the
location, you can use either GENBANK with an exclusion set of
EMBL data not found in GENBANK, or vice versa (e.g., in Basel).
Depending on whether you are connected to a network which is
used to update data on a periodic basis, the GENEMBL set may
include also daily updates.
2) containing weekly updates
3) PATCHX is updated quarterly and includes the previous release
of SWISSPROT, an automatic translation of EMBL, and some other
databases.
4) The definitions vary. XEMBL, EM_NEW, EMBL_DAILY, GB_NEW, XSWISS,
SW_NEW, PIR4, etc. are names that denote the character of the
preliminary entries.
5) This is a Basel-specific item. The main purpose of this database
is to find new data in the annotation, as updates rarely include
changes in the sequence. In order to have the main EMBL database
show not too many entries in FASTA runs, the XXEMBL database
is not included in the usual GENEMBL set.
6) This is a Basel-specific item. The weekly updated GENBANK
database is calculated against EMBL and XEMBL to find those entries
which are not in the EMBL updates yet. Additional databases are
available at Basel. Their names are displayed when you start
the molecular biology environment. Examples are Amos Bairoch's
PROSITE database of protein motifs, or Rich Robert's REBASE database
of restriction enzymes.
NOTE: The term GENEMBLPLUS, introduced in GCG version 8.1, is
equivalent to GENEMBL. This is a deviation from the standard
GCG installation which uses GENEMBL:* to describe all databases
except EST and STS sections.
$ findpatterns/nomon
$ findpatterns/mismatch=4
PROSITE is the protein
site database from A.Bairoch. It can be searched with
$ motifs
If the full text of the abstract is required it can also be
searched with
$ motifs /reference
The normal PROSITE search for a pattern
does not include "frequently" found patterns such as glycosylation
sites. If you want those to be shown as well use
$ motifs /frequent
The SRS system allows you
to search for annotation items in the PROSITE database effectively.
After a search in PROSITE or PROSITEDOC any resulting hit can
be linked into any other sequence database. Similarly, any EMBL
or SWISSPROT entry can be linked into the PROSITE database within
navigation mode. Alternatively, a whole set can be linked with
[X] Expression
and then something like
SQ1 > PROSITE
As a result from research projects,
other protein pattern databases have been produced, and are available
in various ways, such as BLOCKS and PRODOM.
BLOCKS and PRODOM are not installed
for protein searching but HASSLE access is in preparation. Both
databases are available, however, in the SRS
system (see also description in the
related section ).
================================= Begin Exercise 11
Patterns: Use of the motif searching program to detect
motifs in a protein sequence. Definition of an own pattern derived
from previous analysis.
Generate a schematic comparison (using the compare
and dotplot programs) with your peptide and
use parameters like window 30, stringency 15. Measure the positions
where "diagonals" indicate local homology:
Run the bestfit program on the fragments
determined above and write down the aligned sequences here (6
to 10 might be sufficient, less than 4 are too few):
Create a pattern and search it with the findpatterns
program. Compare the pattern and the search output
with the entry you found in the first exercise.
Search your own sequence from a previous exercise (my1.pep)
with the motifs program in the PROSITE database.
Compare the output with your pattern derived in the experiment
above.
================================= End Exercise 11
sequence fragment 1: GDRD
sequence fragment 2: GERL
Our sequence match table, therefore, reads for aspartic
acid (for the symbol D in position 2)
D meeting D => match
meeting E => match
meeting any other => mismatch
A suitable matrix would allow this as well. However, keep
in mind that a matrix covers symbol pairings at any position.
To make this clearer, look at the fourth position of the sequence
fragments above. If we have a leucine residue at a position where
we find aspartic acid usually, this were (for the symbol D in
position 4)
D meeting D => match
meeting L => match
meeting any other => mismatch
Again, a suitable matrix would possibly cover this occurence,
but we would have a problem with our first case: As we now allowed
L in parallel to D, this were
bad for our definition D or E earlier.
The solution, therefore, is to define substitutions as a function
of the sequence:
D meeting D => match
meeting E at pos. 2 => match
meeting L at pos. 4 => match
meeting any other => mismatch
Definition of a Pattern Language
sequence fragment 1: GDRD
sequence fragment 2: GERL
pattern description: G(D,E)R(D,L)
sequence fragment 1: GDGTRD
sequence fragment 2: GERL
sequence fragment 3: GDMRD
pattern description: G(D,E)(X){0,2}R(D,L)
sequence fragment 1: GDD
sequence fragment 2: GEE
sequence fragment 3: GNN
sequence fragment 4: GQQ
pattern description: G~(A,C,F,G,H,I,K,L,M,P,R,S,T,V,W,Y)(D,E,N,Q)
Creation of Patterns
Iterative Schedule
Considerations: Pattern Sensitivity
1
----- = 5%
20
Four residues, therefore, account for
1 1 1 1
---- * ---- * ---- * ---- = 0.00000625 = less than 0.001 %
20 20 20 20
This seems to
be rather significant. Let us consider the example as derived
above and compute the probability of two options:
sequence fragment 1: GDRD
sequence fragment 2: GERL
option 1: pattern description: G(D,E)R(D,L)
option 2: pattern description: G(D,E)X(D,L)
The probability of option 1 is calculated with changed
values at position 2 and 4 as we have two alternatives each:
1 1 1 1
---- * ---- * ---- * ---- = 0.000025 = 0.0025 %
20 10 20 10
The probability of option 2 has a "joker" in position
3, and as we will not find sequences less than four residues
in sequence databases we can set the probability to find an X
in a tetra-peptide at position three as 1:
1 1 1
---- * ---- * 1 * ---- = 0.0005 = 0.05 %
20 10 10
The value of less than a promille looks low, but is much
too high in order to allow a very sensitive searching. Assume
a database of 50000 sequences and further that a match can occur
only once. (This is a crude guess as the chance of pattern occurrence
depends on the composition). By chance, we would have 0.0005
multiplied by 50000 which results in 25 hits - by random chance!
Programs
The 'findpatterns' Program
Database name GCG name contents
----------------------------------------------------------------
EMBL + Updates
GENBANK + Updates
(GB as exclusion set) GENEMBL: all DNA databases (1)
SWISSPROT SWISSPROT: most proteins (2)
PIR International PIR: most proteins
PATCHX + PIR MIPSX: MIPS merged database (3)
NEW entries of EMBL XEMBL: EMBL new entries (4)
UPDATED entries EMBL XXEMBL: EMBL updated entries (5)
GENBANK update excl. GB_NEW: GENBANK exclusion (6)
FINDPATTERNS identifies sequences with short pattern queries like
GAATTC or YRYRYRYR. You can define the patterns ambiguously and
allow mismatches. You can provide the patterns in a file or simply
type them in from the terminal.
FINDPATTERNS in what sequence(s) ? genembl:*
Enter patterns individually, one per line. End the list with a blank line.
Pattern 1: G(D,E)(X){0,2}R(D,L)
Pattern 2:
What should I call the output file (* FindPatterns.Find *) ? <RETURN>
The data can also be searched using the mismatch option, which
allows a pre-defined number of matches. Depending on the question
asked, the output can be fairly voluminous.
A PROSITE Database Searching Program
|Vertical:from-to | Horizontal:from-to| weak/strong/other
|-----------------+-------------------+------------------
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13
|----+----+----+----+----+----+----+----+----+----+----+----+----
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
| | | | | | | | | | | | |
JAM produced file:
SEARCH10.HTML as [next page] , or [overview] , or [table of contents]