Chapter 6: How to Get Information from the Databases



Section 6.1: Principle

[ Previous chapter ][ This chapter ][ Next chapter ]


Subsection 6.1.1

Production of Databases

The collection and maintenance of data is performed at centres like the EBI (European Bioinformatics Institute, an outstation of EMBL) or the NCBI (National Centre for Biotechnology Information). Other centres are similarly active, these two shall only serve as examples.

The end user is not expected to employ the sophisticated software which these institutions use to collect, maintain, and curate data. After an export procedure to a so-called flat file, the data are distributed to the end users' sites in various formats. The main paradigm is that each biological sequence is described in an entry which has a title, the sequence data and associated reference information. In a "real" database system, these data are accessible in a smooth and interlinked fashion. To benefit from the databases in their original form, however, the customers would need to install the very expensive and staff-intensive database software (so-called relational database systems). During the export to flat files, a considerable part of structuring information is lost and, therefore, auxiliary information must be printed into each entry. The application software at the end user's site must use various conventions (called a format) to bring you the information as close to the original comprehensive set as possible.


Subsection 6.1.2

Contents of an Entry

Each entry has

Some data which serve administrative purposes, such as section information or dates of creation or updating, are not listed. Optionally, one or more of the following data are attached to an entry if known:

If you want to retrieve an entry from the database, it is important to decide what type of query will be most effective:


Subsection 6.1.3

Networks of Databases

Today's sequence databases have a significant number of cross-references to other databases. A protein sequence, for example, will have one or more references to the DNA sequence(s) coding for the protein, and possibly also hints to databases describing protein motifs (such as the PROSITE database ) or organism-specific databases. Recently, the interest of researchers focused on genome projects. Therefore, information on the genetic locus might be contained in the database and also pointers to other databases which deal with genomics specifically. All these entries will refer to publications which are described in the literature databases. Your computer does not necessarily have all these databases available within the application software used for sequence analysis (such as the GCG package), but browser programs, like the SRS database browser , are capable of handling these complex networks of databases.

To make the best use of the widely available databases, you first need to find out which databases are storing the information you are looking for in most comprehensive fashion. If you only search for a given accession number, you will be able to search all the sequence databases simultaneously. However, searching a genetic locus of a disease or a protein motif for a specific protein function will succeed more efficiently if you use one of the databases specifically made or this purpose. In the two examples mentioned, the databases of choice are OMIM and PROSITE , respectively. Once you encounter hits in one database, you should use this information to expand to other databases as well - once you have found one description of a sequence, your search is not finished.


Subsection 6.1.4

Computer Networks

The access to databases is no longer necessarily performed on the same computer where you usually do sequence analysis. Some programs operate via networks exclusively, such as the famous SRSWWW browser . The sections below reflect this fact. It is, however, important to note that the retrieved sequences will be in specific formats. The data will be ordered in a way that the software you want to use for further analysis can or cannot interpret them correctly. Therefore, you must determine the formats of the entries you get via computer networks and apply appropriate procedures for reformatting if the data shall be used in the GCG program package.

SECURITY NOTICE: Once you use wide area computer networks, you will most probably access databases and computers which are not under local control. Information quality, therefore, might not apply in the usual way. This consideration is particularly important for environments beyond firewalls (commercial companies).


[ previous chapter ],[ this chapter ][ next chapter ] , [next page/section] , or [overview] , or [table of contents]