Programs to find patterns in DNA sequences were already mentioned in the single sequence analysis section . The programs mentioned there search for patterns using compositional analysis. However, restriction mapping programs do also use pattern approaches.
[ Previous chapter ][
This chapter ][ Next chapter ]
Computers are
fast if the comparison of letters or numbers tests
whether these are
equal
or
not. In order to apply this principle to biology, we need to
define this "matching" in a simple
table. E.g., if we compare
two protein sequences and find a letter
D,
this is a symbol for
aspartic
acid.
The test for "equal" or "not equal", therefore, reads as
The main benefit of patterns is that
defined substitutions
can occur as
the result of a
combination of examples
- in contrast to a sequence comparison
based on a comparison
matrix derived from a sequence-independent
generalisation of alignments.
This example refers to a calcium binding site where
we have
proven examples of sequences containing
E
or
D
at
a given position. To write our pattern,
we will allow either of the two sequence characters
to be a 'match' at this position. An example of this in a short
alignment stretch is printed
below:
Our example requires
that we apply different criteria
to different positions of a sequence alignment. In order to
have the
position-specific comparison done by a computer,
a sophisticated program is required
which will do this type of calculation
for us. Note the difference to the
pairwise alignment programs :
The comparison of symbols was position-independent there.
Now, using
patterns, we do position-dependent comparison calculations.
As we cannot expect
to get a specific program for each special pattern,
we need to have a
pattern matching
program
which will define the rules of patterns in a
pattern language.
The creation of such a convention to describe patterns in a flexible
fashion is not as difficult
as one might assume
because patterns are typically short in length
(a few amino acids up to some dozen, but rarely more than
hundred symbols) if compared to
a query sequence which we want
to screen for the occurrence of a pattern. By reading pattern
definitions, the pattern matching program will be
able
to search a given sequence for this pattern defined in
a specific language. There are various
ways to define such a
language, and the "regular expressions" of some essential
programs of
the UNIX operating system use such a language.
The
GCG software package
searches
with a straightforward definition, the most important
features are listed below.
The residues to be allowed in a given position are listed with
a comma (,) as a separator and
are embraced by parenthesis
(
and
).
To specify our example
above, it were sufficient to have
The number of residues in a pattern might be repeated or vary
in any symbol in a certain range. The pattern language allows
repetitions in the syntax
{minimum, maximum}
which consists of two numbers in curled braces.
Loops
in proteins, for example,
might be described as
Occasionally, the number of residues to be allowed in a given position
becomes fairly large.
Then, it is easier to define which residues are
not
found at this place. The
pattern language uses a tilde (~) in front of a character or a list embraced with parenthesis
in order to define these undesired residues. Note that, in the description
of the four sequence
motifs below, the positions 2 and 3 are identical for
the sake of demonstration. Whereas exclusion
was used to define
position 2, the explicit listing of allowed residues in position 3 seems
to be
more efficient in this case. However, real-world examples exist where exclusion
can
be used beneficially.
D meeting D => match
meeting any other => mismatch
This is not very sophisticated, in
particular, as certain proteins
with specific sites (such as calcium binding)
tend to be
ambivalent and allow either aspartic or glutamic acid. We could
use the
comparison matrices as mentioned
in the pairwise sequence
comparison section. As already
explained there, the comparison of sequences on the basis of
matrices is
computationally expensive. This is not the main reason, however, for
using patterns.
Subsection 10.1.1 Example of Pattern Benefit
sequence fragment 1: GDRD
sequence fragment 2: GERL
Our sequence match table, therefore, reads for aspartic acid
(for the symbol D in
position 2)
D meeting D => match
meeting E => match
meeting any other => mismatch
A suitable matrix would allow this as well.
However, keep in mind
that a matrix covers symbol pairings at
any
position.
To make this clearer, look at the fourth position
of the sequence fragments above. If we have
a
leucine residue at a position where we find aspartic acid usually, this were
(for the symbol
D in position 4)
D meeting D => match
meeting L => match
meeting any other => mismatch
Again, a suitable matrix would possibly cover this occurence, but we
would have a
problem with our first case: As we now allowed
L
in parallel to
D,
this were bad for our definition
D
or
E
earlier.
The solution, therefore, is to define substitutions
as a
function of the sequence:
D meeting D => match
meeting E at pos. 2 => match
meeting L at pos. 4 => match
meeting any other => mismatch
Subsection 10.1.2 Definition of a Pattern Language
sequence fragment 1: GDRD
sequence fragment 2: GERL
pattern description: G(D,E)R(D,L)
sequence fragment 1: GDGTRD
sequence fragment 2: GERL
sequence fragment 3: GDMRD
pattern description: G(D,E)(X){0,2}R(D,L)
sequence fragment 1: GDD
sequence fragment 2: GEE
sequence fragment 3: GNN
sequence fragment 4: GQQ
pattern description: G~(A,C,F,G,H,I,K,L,M,P,R,S,T,V,W,Y)(D,E,N,Q)
[ previous chapter ],[
this chapter ][ next chapter ]
, [next page/section] , or [overview] , or [table of contents]