[ Previous chapter ][
This chapter ][ Next chapter ]
A powerful heuristic of a fast screening can be expressed as follows:
Two sequences are similar if a sufficient number of identical oligomers
is found at a
given arrangement of the two.
The internal loop of a sequence searching program (see above)
requires the identification
of a
score,
which is a numerical value describing the quality of a sequence
comparison. The observed score depends on
The numerical values of scores are specific for a given program and must
not be used for
comparison between different algorithms or result output.
Unless a statistical significance or another type
of
normalisation is computed, (see below) the score
values are
relative
in relation to the search sequence and its algorithm.
The following
example shall demonstrate how a heuristic
sequence search algorithm
might be implemented. We do this on DNA level but no fundamental
difference
will be seen in protein comparisons, as the heuristic algorithms score
with
identities of short sequence fragments
rather than
using comparison tables . To a certain extent,
the
comparison using identities of oligomers can be visualised as
a
dotplot-type of comparison. We reuse the example of
earlier chapters, with
To improve sensitivity, we need to search the
columns of identical shift which occur at continuously
increasing origin, such as
7,8,9
in the column of shift -4 above. Doing
this, we find the following table
of diagonals, sorted by the length of the observed diagonal
The
'fasta'
program (as listed in the "programs" section below) has the
init
scores in the first two columns of its numerical output
which refer to the crude and the joined diagonals, respectively.
Joined diagonals are found
by extending the segment. The extension
will be counted if adjacent diagonals can be joined
to increase
the overall score of the newly found segment.
The
'blast'
type of programs, also mentioned below, will gain speed by determining
the occurrence of oligomers in the database sequences in a preprocessing
step (to be executed
once only at formatting time of the database)
before
the actual search is
started. This reduces the need to scan the
database each time for every comparison. However,
as the pre-computation
is done on the entire database, the arbitrary search of
subsections
of the database will no longer be possible without
executing the format process into a 'blast'
database).
After collecting
sequences of interest, the
ranking
and display of the found alignments is
performed, which is
also a crucial step for successful result evaluation.
The most difficult
problem of reliable sequence similarity
searching is the proper selection of criteria which
describe
statistical relevance
and map as advantageous as possible to
biological
relevance.
The
'fasta'
type of programs will redo the comparison of the top-score
entries as found in the database with a rigorous algorithm
(such in the GCG program
bestfit
and in rigorous searches as described below). A more accurate
alignment
of the database sequences with the query
sequence, which eventually
goes along with a
re-scoring
of the hits, will
allow to increase the power of a sequence comparison.
It should be kept in mind that the re-scoring
(in the
'fasta'
programs called the
opt
score) will
not be reflected in the listing
order of top-scoring entries as these are sorted for the so-called
"initn"
score (the joined fragment score).
NOTE: The fasta 2.x version of this package adds very comprehensive statistics.
However,
at the time of writing, the GCG software did not include this
version of the fasta package.
The
'blast'
programs use a statistical approach in order to list only
sequence
segments in the result which are not expected by chance. To do this,
it is necessary
to determine a score which can discriminate
between "hit" and "random". The information unit
is a "bit" and
might be explained as follows:
The amount of information stored in biological
sequences
is, from a statistical point of view, a function of the sequence
composition. E.g., a poly-A stretch in a DNA sequence has the
information
only A
and the
length
of the sequence. Information can be measured in
bits,
which is either
YES
or
NO,
or
0
or
1
in
computer science world. A nucleotide, therefore,
which can be any of four characters, can be
written in
two bits
as the relevant information of four different symbols
requires
four different combinations:
The amount of data you will need to review after a sequence search might
be enormous. To
help you evaluating the result, most searching programs will permit to generate a
histogram
which displays the scoring table
graphically: The number of found scores
is plotted against the score. The very large peak in
the area of
low scores can be attributed to "random" hits which occur by chance.
Depending
on algorithm and sequence, the hits of remote similarity
will occur in the downhill area of
this peak. Sequences from other organisms
score at higher values, and the identity score is,
typically, at vary
high values if a sequence is found in the database which matches
the query
sequence directly.
Figure I
shall illustrate this
description of a
search histogram
.
Figure I:
Score distribution as observed in a sequence
similarity search.
The peak at low scores has been purposely truncated
and is not shown in
its full size.
NOTE: Unless a reliable statistical estimation is part of the search
program, you should
investigate the region of the beginning
"statistical noise" carefully. However, keep in mind
that
statistical relevance might not match biological requirements
and might be misleading.
The significance of searches can be improved if the search is
conducted on the
protein
level
rather than the DNA level. This
is due to the fact that codon
usage differences between different algorithms will increase
the number of mismatches even if the protein resulting from the
two different sequence fragments
will be identical. Programs
are available which will search the DNA sequence database
after
an on-the-fly translation to all possible six reading
frames. Doing these kinds of searches,
two major considerations
will apply:
The
'framesearch'
program of the GCG package
takes care of this effect to some extend,
but utilises extreme resources for completion. This
program uses
a rigorous approach and is described
later .
% blast
This program will be able to do searches on
protein level if you
use DNA sequence databases. The original
NCBI blast program
package
as available from the NCBI includes the following programs:
The
GCG implementation
of the 'blast' program suite uses a single program
-
'blast'
- which launches any of the programs mentioned above. (This
is called a front-end program).
The "blast" suite is a program which may run either locally or via
network.
The "blast"
system on your UNIX system
might not run locally but rather via networks.
The HASSLE system
in Europe, or the NCBI server system in the US or
generic GCG package installations, can use
the network to send sequences
to a server which computes the data for you.
NOTE: Be aware of the fact that sending sequences over networks is a potential
breach
of confidentiality or security. In particular, this applies to
sites which do not operate on
public data networks (i.e., for-profit installations).
Most useful databases available at the NCBI, or via the HASSLE system, include:
1) The definition for "nr" might vary. Depending on the location,
you might use either
GENBANK with an exclusion set of EMBL data
not in GENBANK, or use EMBL with an exclusion set
of GENBANK
(eg, in Basel). Depending on whether or not you are connected to
a network which
is used to update the data on a periodic basis,
the "nr" set includes also daily updates. At
Basel, both EMBL and
GENBANK are updated weekly, and EMBL is the basis of exclusion
using
the NCBI's 'nrdb' program.
2)The definition of "nr" might vary from site to site. At Basel,
all available databases and
their corresponding updates are
computed on a weekly basis with NCBI's 'nrdb' program.
Additional databases are available at some sites, and will be
displayed at the menu if
you start the "blast" program suite.
It might happen that the 'blast' suite of
programs is not enabled or temporarily unavailable
at
the time of your command due to resource limitations. It should
be kept in mind that
'blast'
programs are good for screening but
should always be supplemented with other screening methods
(e.g.,
swsearch
or
MPsrch,
see below) in order to confirm
findings.
General purpose, fast and reliable searches are done using
% fasta
The
GCG implementation
of the 'fasta' program is typically a (s)lower
version than the one
distributed by the Author William Pearson.
As statistical evaluations
are not implemented yet, GCG users will
have to run the shuffle program
as described below.
Problems with the
'fasta'
programs are most usually
observed when the database has been specified
incorrectly. Use the database name and a colon
with an asterisk to give
the correct specification, following the instructions below.
Databases available
at the local site usually include:
1) The definition of GENEMBL can vary. Depending on the
location, you can use either
GENBANK with an exclusion set of
EMBL data not found in GENBANK, or vice versa.
Depending
on whether you are connected to a network which is used to update data on a
periodic basis,
the GENEMBL set may include also daily updates.
2),3) The definitions vary. XEMBL, EM_NEW, EMBL_DAILY, GB_NEW,
XSWISS, SW_NEW, PIR4, etc. are
names that denote the character of the
preliminary entries. Depending on your site and/or affiliation,
those entries which are not found in either the EMBL or GENBANK update sets yet,
possibly show
up in the corresponding other set as so-called "exclusion set".
Other site's GB_NEW and EM_NEW
may contain all entries of GENBANK and EMBL, respectively,
which can cause duplications.
NOTE: The term GENEMBLPLUS, introduced in GCG version 8.1, is equivalent
to GENEMBL, which was
used before version 8.1.
If you need
to search for peptide in the DNA database, this can be achieved
with the tfasta
program, which translates the DNA database on-the-fly
into all possible reading frames. This
search is computationally more
expensive but considered to be more sensitive (see above, and
below).
Subsection 11.2.1 Principle of Similarity Detection
atggtaatggcacaattgactttcctgaatttctga Seq. A (formerly, horizontal sequence)
tgatggtcaagtaaactatgaagagttt Seq. B (formerly, vertical sequence)
In contrast to the dotplot, however, we now create a
table
of oligomers
and note the location of the occurrence of these oligomers
in the sequences. We are going to
use di-nucleotides here, which is for
the sake of brevity only, as nucleotide sequences typically
use larger
"words" (such as six or more). Sixteen oligomers are possible, but not
all of them
are found in the two sequences. We tabulate the oligomers and
the begin of the sequence position
for sequences A and B as follows:
A B A B
AA 6,14,28 9,13,14,21 CA 11,13 8
AC 12,19 15 CC 24 -
AG - 10,22,24 CG - -
AT 1,7,15,29 3,18 CT 20,25,33 16
GA 18,27,35 2,20,23 TA 5 12,17
GC 10 - TC 23,32 7
GG 3,9 5 TG 2,8,17,26,34 1,4,19
GT 4 6,11,25 TT 16,21,22,30,31 26,27
Having created such a table, we made already progress to determine
the largest segment
of identity between the two sequences. The next question to answer
is whether we can find
oligomers which match at a given, but identical
shift of the two sequences. This can be readily
achieved by calculating the difference of sequence position (called a
shift
)
between A and B for all oligomeris of the previous table.
We now tabulate the position start
of sequence A versus the
shift to sequence B:
Shift
observed -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
(in A) ----------------------------------------------------------------------
Locations 25 8 13 7 11 2 14 1 6 22 18 4 5 6 9 16
26 14 8 14 2 12 21 21 6
27 30 9 3 15 22 14
28 31 19 4
20 17
27 18
30
31
The result could be counted in a very
primitive fashion as in the
dotplot-type of example. In dotplots, a point was shown for each
matching oligomer. The columns in the table above represent identities
of a diagonal. The
most matches on a single diagonal occurs at a shift of
-4
from sequence A
to sequence B, compare this to
what was already observed in the schematic
comparison section .
Similarly, the second-best column with a shift of
+2
was
already known. However, this type of crude
approach might not be sufficient to find the
best
matching segment of identity as we ignore gaps entirely
and just count occurences of oligomer
matches on a diagonal.
Length = (number of oligomers -1) + (size of oligomer)
Segment with shift 2 -4 -7 -5
Segment start position 1 7 26 13
Length 5 3 4 3
The occurrence of
mismatches
will be noticed as an interruption
of the segment.
Gaps
will occur as a change of shift. Interruptions and changes
of shift,
therefore, will not be considered in the initial identification, and
many algorithms
will not consider gaps at later stages either.
A 00
G 01
C 10
T 11
The
information contents
of a match, if expressed in bits of information,
will therefore
contribute to the
ranking
of the sequence. The
'blast'
type of programs uses this method to determine the
relevance of a sequence searching
result. Additionally,
the
probability
of finding the segment of interest
by chance is calculated on the
basis of the database and query information content. The lower
the
probability in its value, the higher the significance, as the
hit will less favourably
occur by chance.
The concept of information contents can be used to introduce
thresholds
in sequence comparison, which will determine whether a match is
carried or a score
is computed at all: If the information content
of a sequence is below such a threshold, no
comparison will
be made, and the result of a search will be that no relevant
hits will be
reported. This is a rare case but will occasionally
occur in
'blast'
searches.
Subsection 11.2.2 Expectations
number
of hits
^
| //// statistical
| * * noise
| * * <----------
| * *
| * * species
| * * similarity
| * * remote score
| * * similarity |
| * * | | identities
| * * v v |
| * * * * v
| * * * *** *
+-----------------*********-----***********-***-->
score
The
discrimination
between "remote similarity" and "statistical
noise" is crucial and
severely depends on the algorithm and the data used for comparison.
Subsection 11.2.3 Programs
Database name GCG name contents
----------------------------------------------------------------
EMBL + Updates
GENBANK + Updates
(GB as exclusion set) nr all DNA databases (1)
SWISSPROT swissprot most proteins
SWISSPROT +
PIR International
+ PATCHX + OWL nr all peptide databases (2)
Database name GCG name contents
----------------------------------------------------------------
EMBL + Updates
GENBANK + Updates
(GB as exclusion set) GENEMBLPLUS: all DNA databases (1)
SWISSPROT SWISSPROT: most proteins
PIR International PIR: most proteins
NEW entries of EMBL EM_NEW: EMBL new entries (2)
GENBANK updates GB_NEW: GENBANK new entries (3)
[ previous chapter ],[
this chapter ][ next chapter ]
, [next page/section] , or [overview] , or [table of contents]