[ Previous chapter ][ This chapter ][ Next chapter ] Generally, pattern creation from protein sequences is more straightforward and requires less experience and time than DNA patterns. You might review the corresponding section for details on nucleotide patterns and limitations of the method.
The creation of own patterns involves several steps.
The 'bestfit' program
calculates the best local
homology as applied on several sequences. It might be suited to identify
patterns as revealed by a 'dotplot' generated from
a
'compare' run.
The identification of repetitive patterns requires to
use the pairwise sequence alignment
program with
varying start- and end regions as only the
best hits occurring in a given sequence
are reported.
A good approach is to use a piece
of paper and write down the
sequences
of interest. Next,
decide whether several amino acids
can be defined as occurring
"alternating", e.g.,
E OR D, to be written as (E,D). Before assigning an X (for "any"),
make
sure that
you cannot use the NOT in terms of that in a heterogeneous motif you
still have
amino acids which are "never" observed (e.g., never seen W or H is
written as
~(W,H). Use the command genhelp motifs define
to view details in the
documentation how to write patterns or refer to the description
above .
Use the created pattern in order to re-screen the sequence(s)
you used to create the pattern.
Make sure you found all sequence
fragments which were used to construct the pattern. Use
the
program findpatterns (see below).
If you do not detect the pattern as expected,
consider to weaken
the stringency by allowing mismatches in the pattern search
(see below)
and
estimate the loss of sensitivity caused by this flexibility -
occasionally, it is advantageous
to have more alternatives
(resulting in more complex patterns) rather than allowing mismatches.
Search an alternative sequence or a subset of the database
(e.g.,
SWISSPROT:*YEAST) in order to detect which other possible patterns
apply. Specifically, mismatches (see above) and wildcards (X in proteins)
give rise to the
detection of 'false positives'. Write down the
desired and the undesired exceptions. This might
sound odd as
the 'computer' should do all this for you but pattern refinement is
manual work
and can rarely be automized satisfactorily.
Consider to rerun the findpatterns program with
the option "mismatch=2"
in order to allow for an even more flexible
definition. Locate variations which have been assigned
"X" and determine
whether the found symbols permit a more detailed definition, and
check whether
an extension of the pattern is feasible. The output
of pattern searching programs usually
show the "overhang" on each side of the pattern.
If you feel that you miss important positions
because this
overhang is too small, append
(X){10,10}
to each side of the
pattern
(which will add ten more symbols to each side without affecting
the rest of the pattern)
and rerun the pattern search.
Consider the output of the steps as performed above
and evaluate
the possibility to write down more than one pattern. This is possibly a
more
sensitive approach. To do this, fetch the file "patterns.dat"
and call a
text editor.
Once having started 'vi', 'emacs' or similar,
enter your patterns into this file. Possibly, you will need to have
more than one pattern with
similar length. The purpose of this
approach is to exclude the negative effects of arbitrary
combinations (see considerations below).
Rerun findpatterns with the new pattern file on the whole
database and validate
your satisfaction...
Length
If the pattern is
long,
the sensitivity might be rather
high.
However, the
sensitivity
will decrease if the basis for the pattern
creation was too small.
If you use a single sequence as pattern, this will result in a
pattern
search which identifies this (and only this) sequence.
Information Content
If two or very similar sequences are used for pattern creation, the
information added by multiple
sequences might be low. The reason
for this is obvious as the options to decide on "conserved"
residues
is biased by the similarity of the sequences. If four sequences, aligned,
show identical
residues, the information content of the pattern is
identical to an individual sequence fragment.
False Positives
If a pattern is too stringent, it will
miss
sequences which have a slight variant in
the desired sequence motif.
It might be, however, that the sequences used for alignment will
result in a pattern which is well-defined but occurs in a totally
different environment. Pattern
refinement might help only if
the "false positives" can be discriminated by flanking sequences.
The current software, however, does not permit exclusion on additional
information such as
occurrence in specific cellular compartments or similar.
Significance
Patterns with a very large flexibility (such as "XXXXX" in its extreme)
will not be useful.
Additionally, considerations on the statistical
expectation apply. If we have a pattern of four
amino acids,
the
probability
to meet this pattern by chance is calculated
assuming that the
distribution of amino acids is "random". As we have 20 amino acid
symbols,
the probability of a single residue to match is one amongst
20, which is
Subsection 10.2.2 Considerations: Pattern Sensitivity
1
----- = 5%
20
Four residues, therefore, account for
1 1 1 1
---- * ---- * ---- * ---- = 0.00000625 = less than 0.001 %
20 20 20 20
This seems to be rather significant. Let us
consider the example
as derived above and compute the probability of two options:
sequence fragment 1: GDRD
sequence fragment 2: GERL
option 1: pattern description: G(D,E)R(D,L)
option 2: pattern description: G(D,E)X(D,L)
The probability of option 1 is calculated with changed values
at position 2 and 4
as we have two alternatives each:
1 1 1 1
---- * ---- * ---- * ---- = 0.000025 = 0.0025 %
20 10 20 10
The probability of option 2 has a "joker" in position 3, and as we
will not find sequences
less than four residues in sequence databases we
can set the probability to find an X in a
tetra-peptide at position three
as 1:
1 1 1
---- * ---- * 1 * ---- = 0.0005 = 0.05 %
20 10 10
The value of less than a promille looks low, but is much too high in
order to allow
a very sensitive searching. Assume a database of
50000 sequences and further that a match can
occur only
once. (This is a crude guess as the chance of pattern occurrence
depends on the
composition). By chance, we would have 0.0005 multiplied
by 50000 which results in 25 hits -
by random chance!
[ previous chapter ],[ this chapter ][
next chapter ]
, [next page/section] , or [overview] , or [table of contents]