[ Previous chapter ][
This chapter ][ Next chapter ]
Very low similarity
of sequences might not be easily detected if the
search is performed in the entire database.
Due to the
noise level
of similarities scored by chance, important matches
might be missed.
The use of
filters
is essential in this case. A filter
is any procedure applied
to reduce the total number of sequences searched, most desirably
using
criteria which match the expectation of the performed
search. These might be
The basic difference in these methods is the way how the sublibraries
are addressed. Depending
on the algorithm, some of the procedures might
not be available to you. E.g., the
'blast'
database searching programs will not allow the use of user-specified
subsets.
There is a special manual of the GCG package which will tell you
about database sub-libraries
(see below). Depending whether your site honors the
EMBL database or the GENBANK database as
base set, the corresponding
counterpart will be available as subset. This results in the effect
that the GCG program package always has both EMBL and GENBANK logicals defined
even if a
subset contains only a small amount of sequences. In rare
occations, these subsets might be
even entirely empty -
this will happen if EMBL and GENBANK subsections are perfectly in sync.
The
WPI
version of the interface will present these
database subsections
to you in database-neutral fashion if you use the
correct window .
To see what sub-libraries are supported, you might try to obtain an on-line
list as follows
in the command line version:
% name | egrep 'EM | GB'
Use the resulting names as GCG libraries. Additional help
is provided in the
data
set
manual of the GCG package. E.g., the
EST:* specification applies if
you are interested only in the expressed sequence tag section
of the EMBL
data library.
WARNING
The EST section of the DNA databases usually cover all
sorts of species.
If you want to utilize data subsections by organism rather than in its entirety
you would presumably need to employ large lists (such as created with
suitable search programs ) ans process these as described below.
TIP
The SWISS-PROT database uses the organism name as part of the entry name.
E.g., Swissprot:*yeast
will cover all yeast sequences.
To use groups of sequences, a reasonable paradigm is supplied by
each program package
in a specified syntax. This syntax tells the
software that the specification given shall be
used as group of sequences
rather as a single sequence. The GCG package calls this mechanism
a Sequence List. Documentation before the 8.0 release of the
package might refer to this
feature as a File of Sequence Names
(FOSN). The idea is straightforward: Programs do no longer
read the sequence from a file which specifies the sequence data
but rather
use a file as a "pointer" where to look for data, i.e., they
read the sequence from
a file which is specified in a list file.
To maintain compatibility with the established input handling,
files which specify a list of
sequences rather than a sequence
directly are tagged by the character
@
(English
spelling:
"at"
character).
A Sequence List is produced by a number of
programs, such as:
To utilize the resulting file of sequence names, you might use the
@ character in front
of the file name in all programs which use
multiple files. Sequence searching is such an application.
To use a Sequence List as a library, e.g.,
@my.fil
in the
fasta
program,
you may use this nomenclature
at the prompt "which sequence(s)?"
NOTES
1) The file-of-sequence-names method might not be available if you
run your sequence analysis
via networks.
2) The 'blast' suite of programs cannot use file of sequence
names and requires
own database formatting (see below ).
3) WPI users
may use Sequence Lists much more conveniently
by using the
correct window - see below.
To reformat Sequence Lists into other formats, refer to the
reformatting section .
The sequences of a List of sequences are not stored in
the list
file itself. Rather, the List file is a file of
pointers
to the
files which shall be worked on. This implies the danger that, if a
file being pointed to is
deleted, the list of sequences is
no longer valid.
An alternative for the lists of sequences, therefore, is
the
option to write all sequence data into a single file. This
will enlarge the file size,
and also require that a specific
format is defined which allow multiple data rather
than a single sequence in one file. Most conveniently, such a file
is produced by the program
'pileup' . This application
produces
a multiple sequence alignment automatically, and stores
the result in a single file, including
gaps
and the specific
shift
for each sequence. Multiple
Sequence Files (MSF) (*.msf)
are named as
my.msf{*}
Details on the 'pileup'
program are in the section of the
multiple
sequence analysis .
NOTE: Due some technical problems with localised keyboards it might be difficult
for you
to display the characters "{" and "}" by typing the corresponding
characters on the keyboard.
Use the command 'genhelp distances example' and
use the COPY option of your terminal or terminal
emulator to take the {*}
into the Paste buffer. PASTE the resulting keystrokes where appropriate.
To reformat multiple sequence files into other formats, refer to the
reformatting section .
Since Version 8, you may use the Wisconsin
Package Interface (WPI)
via the X-Windows system.
Lists are readily
handled and the base principle of this user interface.
Specifically, lists
might be expanded with a mouse click to
select idividual sequences.
Refer to the
corresponding WPI section for details.
The usage of Lists might be restricted as not all
databases are available at each site.
Specifically, if you
run your sequence analysis via networks or move from one site
to another
the lists might become affected if site-specific
features are included.
Keep in mind that Lists are created at a defined point in time. If you use
keyword searching , your List will reflect the status of the
database at this specific
time point. You might want to redo the
keyword search frequently in order to maintain an up-to-date
set of sequences.
See also the notes below on the creation of own databases.
Lists are notorious troublemakers
if disk space is tight and references are
made to specific user-provided files. This implies
that any 'cleaning' of sequence
files from your directories might render lists unusable if
the references are
obsolete. Similarly, if you work on several machines, the simultaneous use
of Lists
on different computers implies the identical
directory structure and the
presence of all desired files in the expected locations. The output
of the
the profilesearch
program (see
next chapter ) is a List as well,
and known to inherit the location of the data used for
searching.
Eventually, manual editing is required
to overcome this limitation.
If you change location (e.g., move from the US to
Europe or vice versa),
you will possibly change environments. DNA databases are usually based
on EMBL
with GENBANK exclusion (such as in Europe), or GENBANK-based with EMBL as
exclusion
(such as in the US). Pacific rim sites will use DDBJ as base database.
Unfortunately, the database
names used by GCG software reflect the
database origin, and therefore are not transferrable
easily.
The only option to circumvent this 'feature' is to map one List into another
using
the SRS software . However, keep in mind that this
option
will consume a considerable amount of resources if you are not
used to such work or cannot
rely on local experts.
Lists of sequences or
Files of Sequence names can be very time-consuming
if you need to search a large amount of
data. If you
have enough disk space, you can create or ask your system
manager to create
how to create your own database with the
command dataset .
NOTE:
================================= Begin Exercise 12
Sequence searching:
Use of the 'blast' and 'fasta' searching programs to analyse
DNA
and protein sequences derived from previous analysis.
Use the blast and fasta programs to
search the typed-in
sequence in the largest database you might
access. Write down which of the sequence features
occur
where in the database:
Use the blast and fasta programs to
search the translated
sequence in the largest database you might
access. Write down which of the sequence features
occur
where in the database:
[next page] , or [overview] , or [table of contents]
Subsection 11.5.1 Database Sub-Libraries
Subsection 11.5.2 Sequence Lists (formerly File Of Sequence Names (FOSN))
Subsection 11.5.3 Multiple Sequence Files (MSF)
Subsection 11.5.4 Lists within the Wisconsin Package Interface (WPI)
Subsection 11.5.5 Impact of Electronic Networks, Time Effects and Location
Subsection 11.5.6 Creation of own Databases
| Query | Database | Feature or name of
| from-to| entry |from-to| identified sequence
|--------+--------+-------+------------------
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| Reading | Query | Database | Feature or name of
|frame no.|from-to| entry |from-to| identified sequence
|---------+-------+--------+-------+---------------------
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
================================= End Exercise 12