[ Previous chapter ][
This chapter ][ Next chapter ]
The collection and maintenance of data is performed at centres like the
EBI
(European
Bioinformatics Institute, an outstation of EMBL) or the
NCBI
(National Centre
for Biotechnology Information). Other centres are similarly
active, these two shall only serve
as examples.
The end user is not expected to employ the sophisticated software which these institutions
use
to collect, maintain, and curate data. After an
export procedure
to a so-called
flat file,
the data are distributed to the end users' sites in various formats.
The main paradigm is that each biological sequence is described in an
entry
which has a title, the sequence data and associated reference information.
In a "real" database system, these data are accessible in a smooth and
interlinked fashion.
To benefit from the databases in their original form,
however, the customers would need to
install the very expensive and
staff-intensive database software (so-called relational database
systems).
During the export to flat files,
a considerable part of structuring information
is lost and, therefore,
auxiliary information must be printed into each entry. The application
software at the end user's site must use various conventions (called a
format)
to
bring you the information as close to the original comprehensive set as
possible.
Each entry has
Some data which serve administrative purposes, such
as section information or
dates of creation or updating, are not listed.
Optionally, one or
more of the following data
are attached to an entry if known:
If
you want to retrieve an entry from the
database, it is important to decide what type of query
will be most effective:
Today's sequence databases have a significant number of cross-references
to other databases.
A protein sequence, for example, will have one or more
references to the DNA sequence(s) coding
for the protein, and possibly also
hints to databases describing protein motifs (such as the
PROSITE database ) or organism-specific databases.
Recently,
the interest of researchers
focused on genome projects. Therefore,
information on the genetic locus might be contained
in the database and
also pointers to other databases which deal with genomics specifically.
All
these entries will refer to publications which are described in the
literature
databases.
Your computer does not necessarily
have all these databases available
within the application software
used for sequence analysis (such as the GCG package), but browser
programs, like the SRS database browser , are capable
of handling
these complex networks of databases.
To make the best use of the widely available databases, you first
need to find out which databases
are storing the information you are
looking for in most comprehensive fashion. If you only
search for a
given accession number, you will be able to search all the sequence databases
simultaneously. However, searching a
genetic locus of a disease or a protein motif
for a specific protein function will succeed
more efficiently if you use one of
the databases specifically made or this purpose. In the
two examples mentioned,
the databases of choice are OMIM and
PROSITE , respectively. Once you
encounter hits in one database, you should use this information
to expand to
other databases as well - once you have found one description of a
sequence, your
search is not finished.
The access to databases is no longer necessarily performed
on the
same computer where you usually do sequence analysis. Some programs operate via
networks
exclusively, such as the famous SRSWWW browser . The sections
below
reflect this fact. It is, however, important to note that the
retrieved sequences will
be in specific formats. The data
will be ordered in a way that the software you want to use
for further analysis
can or cannot interpret them correctly. Therefore, you must determine
the
formats
of the entries you get via computer networks and apply appropriate
procedures for reformatting if the data shall
be used in the
GCG program package.
SECURITY NOTICE: Once you use
wide area computer networks, you will most probably access
databases and computers which are
not under local control. Information
quality, therefore, might not apply in the usual way.
This consideration
is particularly important for environments beyond firewalls (commercial
companies).
Section 6.1: Principle
Subsection 6.1.1 Production of Databases
Subsection 6.1.2 Contents of an Entry
Subsection 6.1.3 Networks of Databases
Subsection 6.1.4 Computer Networks
[ previous chapter
],[ this chapter ][
next chapter ]
, [next page/section] , or [overview] , or [table of contents]