JAMF Archive

BioCompanion as published in 1995
THIS IS THE REFERENCE CODE AS PUBLISHED.
		Doelz, R.   
		Optimal production of biological documentation: the JAM format.
		Comput. Applic. Biosci. 11, 224-226 (1995).    
		
The version you are currently viewing is the one printed and distributed via the Internet from the server of BioComputing Basel. Version 3.1 of the BioCompanion was published with version 2 of the JAMF software. The server that was indicated in the documentation has ceased to exist.

Version 3.2 of the BioCompanion was not publicly available for free but was shareware that was distributed with GCG's software release 9. For the purpose of enhanced editing, JAMF was partially rewritten and the proprietary version 3.x of JAMF was used from 1996 onwards. The Biocompanion is available in a current version from the publisher . It has significantly changed both in software and content.

JAMF source code

LATEX version source code

	

location: Home > Archive > BioCompanion V2.x (1995)

Chapter 6: HowtoGetInformationfromtheDatabases

How to Get Information from the Databases


Principle

Production of Databases

The collection and maintenance of data is performed at centres like the EBI (European Bioinformatics Institute, an outstation of EMBL) or the NCBI (National Centre for Biotechnology Information). Other centres are similarly active, these two shall only serve as examples.

The end user is not expected to employ the sophisticated software which these institutions use to collect, maintain, and curate data. After an export procedure to a so-called flat file, the data are distributed to the end users' sites in various formats. The main paradigm is that each biological sequence is described in an entry which has a title, the sequence data and associated reference information. In a "real" database system, these data are accessible in a smooth and interlinked fashion. To benefit from the databases in their original form, however, the customers would need to install the very expensive and staff-intensive database software (so-called relational database systems). During the export to flat files, a considerable part of structuring information is lost and, therefore, auxiliary information must be printed into each entry. The application software at the end user's site must use various conventions (called a format) to bring you the information as close to the original comprehensive set as possible.

Contents of an Entry

Each entry has

(Some data which serve administrative purposes, such as section information or dates of creation or updating, are not listed.)

Optionally, one or more of the following data are attached to an entry if known:

If you want to retrieve an entry from the database, it is important to decide what type of query will be most effective:

Networks of Databases

Today's sequence databases have a significant number of cross-references to other databases. A protein sequence, for example, will have one or more references to the DNA sequence(s) coding for the protein, and possibly also hints to databases describing protein motifs (such as the PROSITE database ) or organism-specific databases. Recently, the interest of researchers focused on genome projects. Therefore, information on the genetic locus might be contained in the database and also pointers to other databases which deal with genomics specifically. All these entries will refer to publications which are described in the literature databases. Your computer does not necessarily have all these databases available within the application software used for sequence analysis (such as the GCG package), but browser programs, like the SRS database browser , are capable of handling these complex networks of databases.

To make the best use of the widely available databases, you first need to find out which databases are storing the information you are looking for in most comprehensive fashion. If you only search for a given accession number, you will be able to search all the sequence databases simultaneously. However, searching a genetic locus of a disease or a protein motif for a specific protein function will succeed more efficiently if you use one of the databases specifically made or this purpose. In the two examples mentioned, the databases of choice are OMIM and PROSITE , respectively. Once you encounter hits in one database, you should use this information to expand to other databases as well - once you have found one description of a sequence, your search is not finished.

Computer Networks

The access to databases is no longer necessarily performed on the same computer where you usually do sequence analysis. Some programs operate via networks exclusively, such as the famous SRSWWW browser . The sections below reflect this fact. It is, however, important to note that the retrieved sequences will be in specific formats. The data will be ordered in a way that the software you want to use for further analysis can or cannot interpret them correctly. Therefore, you must determine the formats of the entries you get via computer networks and apply appropriate procedures for reformatting if the data shall be used in the GCG program package.

SECURITY NOTICE: Once you use wide area computer networks, you will most probably access databases and computers which are not under local control. Information quality, therefore, might not apply in the usual way. This consideration is particularly important for environments beyond firewalls (commercial companies).


Obtaining Data from Local Databases

The following sections describe GCG software as well as additional software which may not be part of your installation.

Databases available at Basel University include:

 
Database name          GCG name            contents   
----------------------------------------------------------------  
EMBL + Updates     
GENBANK + Updates   
(GB as exclusion set)  GENEMBL:            all DNA databases (1)  
  
SWISSPROT              SWISSPROT:          most proteins (2)  
PIR International      PIR:                most proteins  
PATCHX + PIR           MIPSX:              MIPS merged database (3)  
  
NEW entries of EMBL    XEMBL:              EMBL new entries (4)   
UPDATED entries EMBL   XXEMBL:             EMBL updated entries (5)  
  
GENBANK update excl.   GB_NEW:             GENBANK exclusion (6)  

1) The definition of GENEMBL can vary. Depending on the location, you can use either GENBANK with an exclusion set of EMBL data not found in GENBANK, or vice versa (e.g., in Basel). Depending on whether you are connected to a network which is used to update data on a periodic basis, the GENEMBL set may include also daily updates.

2) containing weekly updates

3) PATCHX is updated quarterly and includes the previous release of SWISSPROT, an automatic translation of EMBL, and some other databases.

4) The definitions vary. XEMBL, EM_NEW, EMBL_DAILY, GB_NEW, XSWISS, SW_NEW, PIR4, etc. are names that denote the character of the preliminary entries.

5) This is a Basel-specific item. The main purpose of this database is to find new data in the annotation, as updates rarely include changes in the sequence. In order to have the main EMBL database show not too many entries in FASTA runs, the XXEMBL database is not included in the usual GENEMBL set.

6) This is a Basel-specific item. The weekly updated GENBANK database is calculated against EMBL and XEMBL to find those entries which are not in the EMBL updates yet. Additional databases are available at Basel. Their names are displayed when you start the molecular biology environment. Examples are Amos Bairoch's PROSITE database of protein motifs, or Rich Robert's REBASE database of restriction enzymes.

NOTE: The term GENEMBLPLUS, introduced in GCG version 8.1, is equivalent to GENEMBL. This is a deviation from the standard GCG installation which uses GENEMBL:* to describe all databases except EST and STS sections.

Using the GCG Software: 'lookup'

The program 'lookup', introduced in GCG version 8.1, is GCG's implementation of the SRS software . In contrast to the original package, it searches only sequence databases. lookup has several levels of menus. The first one presents a list of sequence libraries which you can select to search. The next level provides the following option list:

 
  
Complete the query form below:  
  
                 All text:  
               Definition:  
                   Author:  
                  Keyword:  
            Sequence name:  
         Accession number:  
                 Organism:  
                Reference:  
                    Title:  
                  Feature:  
  On or after (dd-mmm-yy):                On or before (dd-mmm-yy):  
 Shortest sequence length:                Longest sequence length:  
  
     Inter-field operator:  AND           Form of output list:  Whole Entries  
                                              
There are several types of fields you can fill out:

Keep in mind that the searches are very fast, but may give you a lot of entries. Therefore, the program does not exit after the query has been launched with <CTRL><Z>, but offers a third menu:

 
  
 17110 entries were found.  
  
 Do you wish to:  
  
   1) write out this list to a file  
   2) preview the results  
   3) refine the query  
   4) choose different libraries  
  
   q) quit  
  
 Please choose one (* 1 *):  
  
If you select option 1, a file is created which can be used by other programs. Option 2 displays the description of the sequence and option 3 allows you to refine the query.

NOTE:

1) Since the 'lookup' program requires significant resources, it may not be supported at the local site. The SRS software is more powerful and should be preferred if you decide to work with large databases and non-sequence information.

2) The 'lookup' program may generate lists which cannot be saved due to disk shortage. All lists generated are stored in files. If your query was not selective enough, these lists will become rather big.

3) Keep in mind that the lists will not be updated automatically. As sequence databases grow very fast, your search has to be repeated periodically if you use the lists for other purposes (e.g., sequence searching ).

Consecutive Searches

Consecutive searches are queries which do not search an entire database but a list of files created earlier. 'lookup' can search these lists effectively if the following syntax is used:

$ lookup/infile=@lookup.list

Using the GCG Software: 'stringsearch'

The program 'stringsearch' is not as fast as 'lookup' and may not be optimally suited for your purpose. (Use this program only if you work with a GCG version older than version 8.1.) 'stringsearch' has 2 menu options:

The program identifies entries by searching the sequence documentation with keywords like 'globin' or 'human'. Example:

$ stringsearch

 
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ?   
Do you want to search through:       
	A) definitions       
	B) complete   
	sequence records   
Please choose one (* A *):   
Search for what text patterns ?  bluescript   
What should I call the output file (* genembl.strings *) ?  
  
...   
*** Em_syn:ARBLSKP ***  
pBluescript SK(+) vector DNA, phagemid excised from lambda ZAP 2,958bp  
...       
  
Sequences searched:    69842   
Sequences with matches:        8          
Patterns sought: bluescript              
Output file: genembl.strings  
  
NOTE:

Find Sequences in the Databases with ATLAS

The ATLAS program is the successor of the XQS program, which replaced the NAQ and PSQ programs. None of this software is made or distributed by GCG. These programs have been created at Protein Identification Resource International (PIR) and can be obtained from either office of PIR. ATLAS has an extensive capability to search databases by author, entry name, accession number, or feature, just to name a few.

To get started, you should read the documentation provided with the software. Data can be viewed on the screen with

$ ATLAS

 
  
                               ATLAS of                           
                          PROTEIN and GENOMIC                                
                               SEQUENCE                        
                         Version 1.40, June 1992      
                (C) Copyright 1992 National Biomedical   
        Research Foundation National Biomedical Research Foundation  
3900 Reservoir Road, NW Washington, DC 20007-2195 USA Tel: 202-687-2121       
             FAX: 202-687-1662E-Mail: PIRMAIL@GUNBRF.BITNET  
  
The following command can activate the databases:

ATLAS> bases

The next command activates all databases:

ATLAS> bases *

To search for titles, type

ATLAS> find

and for help

ATLAS> help

At the time of this writing, VMS and UNIX versions of ATLAS were available which are expected to have identical functionality. Also, a MS-DOS version for the CD-ROM was included.

Find Sequences in the Databases with SRS

The SRS program has been written by Thure Etzold, EMBL. The full SRS software is neither made nor currently distributed by GCG. The program has been created at EMBL and can be obtained from the author or by 'anonymous ftp'. SRS has an extensive capability to search databases by author, entry name, accession number, or feature, just to name a few. In addition, links provided in one database can be followed to get to the next entry, e.g., an EMBL entry can immediately be viewed as SWISSPROT entry, provided that there is an equivalent.

NOTE: To run the SRS program, your screen must speak the "vt100" language to display the text nicely.

To get started, you should read the documentation provided with the software. Data can be viewed on the screen with

$ srs

Once started, select [U] for query and [S] for sequence and you will get the mask needed to compose a query. The [S] field opens upon a <SPACE BAR> and enables you to select the databases. At the time of this writing, VMS and UNIX versions of SRS were available which are expected to have the same functionality. The new release of SRS, version 4.x, supports an ASCII interface if the srscurs extensions are installed. 'srscurs' can be run via HASSLE . There is also a very powerful command line interface which is called 'getz'. This program is included in GCG release 8.1, were it is called lookup , and is available from the author (T.Etzold) on request. At the time of this writing, a networked version of SRS was in preparation. SRS is also accessible via the World Wide Web .

Find Sequences in the Databases with ENTREZ

The ENTREZ program has been written by the programming staff at the NCBI. This software is neither made nor distributed by GCG. The program has been created at NCBI and can be obtained as CD-ROM distribution or by 'anonymous ftp'. ENTREZ runs on Mac/PCs, VMS, and UNIX. The latter two require an X-Windows interface.

The big advantage of ENTREZ is the inclusion of a subset of MEDLINE, covering the abstracts of entries submitted to the sequence databases. ENTREZ requires specific data sets which can be purchased on CD-ROM.

The BioComputing facility at Basel holds a subscription to the ENTREZ CD-ROM set. You can also access ENTREZ via network, if you prefer. In Europe, networked versions of SRS are expected to serve similar purposes and use similar or less resources. If you wish access to ENTREZ, please contact the BioComputing laboratory for further information.


View (Local) Sequence Data

View Data on the Screen

The GCG program typedata displays database entries on the screen. The entry specification must be provided as

database:name

or

database:accessionnumber

e.g.,

 
  
genembl:pbr322      (EMBL)   
genembl:synblue     (Genbank)  
  
Note that the entry name or accession number must be determined in advance via any of the methods described above. If you want to view annotation data on the screen, use the command line qualifier "ref":

$ typedata /ref

 
FETCH copies GCG sequences or data files from the GCG database into   
your directory or displays them on your terminal screen.   
  
FETCH what sequence(s) ?  genembl:pbr322   
  
pbr322_ref  
PBR322     - Plasmid pBR322 complete sequence  
ID   PBR322     standard DNA SYN 4363 BP.  
XX  
AC   V01119  
XX  
...  
  
The ATLAS program has an in-built display option. Use the command 'show' on the ATLAS prompt:

ATLAS> show

The SRS program has an in-built display option. Select [O] to set options and [E] to list entries, then select the entry of interest with the arrow keys and press <RETURN>.

The ENTREZ program has an in-built display option. Select the corresponding field to place the query after selecting the data source.

The programs 'gopher' and WWW allow you to view database entries in a very convenient fashion. See the corresponding sections below for details.

Copy Data to Your Directory

The GCG program fetch copies any (GCG) data on disk to your current directory. You can specify data files or database entries. The entry specification must be provided as

database:name

or

database:accessionnumber

e.g.,

 
  
genembl:pbr322      (EMBL)   
genembl:synblue     (Genbank)  
  
Note that the entry name or accession number must be determined in advance via any of the methods described above. The dialogue on the screen is as follows:

$ fetch

 
FETCH copies GCG sequences or data files from the GCG database  
into your directory or displays them on your terminal screen.   
  
FETCH what sequence(s) ?  genembl:pbr322 	  
  
pbr322.syn  
  
The ATLAS program has an in-built copy option. Use the command 'copy' on the ATLAS prompt:

ATLAS> copy

If you want to see a list of sequence names which can be used in ATLAS later on, use

ATLAS> list

The SRS program has an in-built copy option. Select [O] to set options and [E] to list entries, then [H] to save buffer or [N] to write names.

The ENTREZ program has an in-built copy option. Select the <save> button.


Using Electronic Mail to Get Sequences via Network

Mail servers, like the famous European ones at EMBL and EBI (netserv@embl-heidelberg.de and netserv@ebi.ac.uk, respectively), send sequences via electronic mail on request. To use this service, you need to know the procedures for sending and receiving mail. The easiest way to get started is to follow one of the examples below and retrieve the HELP file to get more information. This HELP document explains how to retrieve the data submission form. It should be noted that electronic mail is not recommended for sequence retrieval if you have access to other alternatives (e.g., access via WWW).

The 'MAIL' Program in VMS

Example:

 
  
$  MAIL   
Mail> send   
To: SMTP%'netserv@embl-heidelberg.de'        
Subj: help  
<CTRL> <Z>  
Mail> quit  
$  
  
(The term "SMTP%" can be different on other computer systems.) The system will send the mail and after some time you should encounter the message:
 
  
New mail on node YOGI from SMTP%"netserv@embl-heidelberg.de"  
  
Note that even if you are not logged in, the message will be received anyway. In contrast to other technologies, like 'gopher' and WWW, electronic mail does not require that you wait for the reply. This is called asynchronous processing. If you received electronic mail while you were not logged in, you will see the following message when you start a new session:
 
  
You have 1 new mail message.  
  
Example:
 
  
$  mail   
You have 1 new mail message.  
Mail> last  
From:SMTP%"netserv@embl-heidelberg.de"  
Subj: Automatic reply......  
Mail> extract/noheader  
File: mail.dat  
Mail> delete  
Mail> quit  
$  
  

The 'EAN' Program in VMS

The following example assumes that you have setup and configured 'ean' correctly. (This is an automatic procedure if you start the program for the first time.)

Example:

 
  
$  ean  
> comp  
To: netserv@embl-heidelberg.de  
Subj: help  
<CTRL> <Z>  
send   
options?  send  
> quit  
$  
  
If you have set the 'autoedit' option to TRUE, you can jump directly into the
editor . The screen clears after you have typed the "Subj:"-line and you can enter the message as you would type it in your normal editor. To leave the editing mode, you must hit <CTRL><Z> twice! Then you can do the 'send' as shown above. The system will send the message. After some time, a message should show up if you are still logged in. If you received electronic mail while you were not logged in, you will only be notified when you start a new session if the line
 
  
$ eancheck  
  
is part of the so-called LOGIN.COM file. If this is not the case, you can either append this line to your LOGIN.COM file or type this command interactively. The message will look like
 
  
### New EAN mail ###  
  
The messages are received automatically when you start the 'EAN' program. You will get a list of messages (with numbers) to choose from, and the message is only displayed after proper selection. Refer to the command 'help' for possible options for the 'ean' program.

Example:

 
  
$ ean  
Accepting messages ....  
1 NU netserv@embl-heidelberg.de Thanks for the call, ...  
2 NU netserv@embl-heidelberg.de Automatic reply ...  
> 2  
From:netserv@embl-heidelberg.de  
Subj: Automatic reply......  
> print full on mail.dat  
> delete  
> quit  
$  
  


Find Sequences in the Databases with 'gopher' (via network)

The program 'gopher' is a browsing tool which was primarily used for retrieving text information. Some servers allow you to search for keywords in a database and retrieve database entries from a menu presented afterwards. The keywords should be of high significance (e.g., an accession number). Amongst others, the following 'gopher' servers are available:

 
  
Indiana:    GENBANK database  
Houston:    PIR database  
  
You must be connected to the Internet to contact a remote server. To connect to a 'gopher' server, you need to have the so-called 'gopher' client program. Both client and server software is available from the University of Minnesota, where 'gopher' was developed. Program versions for various platforms, including PCs and Macintoshes, exist. To start 'gopher' on a terminal or terminal emulator, type

$ gopher

You can also try other programs like 'xgopher' or 'mosaic'. (If you have access to 'lynx' or 'mosaic' , you should use these programs instead of 'gopher'.) After you have started 'gopher', you should see a menu list. Use the cursor keys to select an option and hit <RETURN> to activate the field. In Europe, search for "EMBnet Information Resource", then "Database", and you will find the databases mentioned above.

To do a database search, select one of the databases and you will be prompted to enter a keyword. The hits will be presented as menu options. Selecting either of these gets you the text on the screen. After having inspected a text file, you can either continue with <RETURN>, save it by pressing the <S> key or mail it. Save it now, and then convert the format of the retrieved sequence as described in section "Reformatting Sequences" .

NOTE: The filename of the sequence saved will change during this procedure; see note in the reformat section .


Find Sequences in the Databases with World Wide Web (via Network)

The World Wide Web (WWW) is a network archiving system which is primarily used for retrieving text information. Some servers allow you to search for keywords in a database and retrieve database entries from a menu presented afterwards. The keywords should be of high significance (e.g., an accession number). Amongst others, the following 'gopher' servers are available:

 
  
SRSWWW:    EMBL, SWISSPROT, other databases, many sites world-wide  
NCBI:      GENBANK, SWISSPROT, other databases  
HOUSTON:   various databases  
  
You must be connected to the Internet to contact a remote server. To connect to a WWW server (also known as 'httpd' daemon), you need to have the so-called WWW client program. Famous WWW clients are available for many platforms, including PCs and Macintoshes. The 'mosaic' client uses a graphical user interface and can only be used on systems which are equipped for this purpose, such as X-Windows or personal computers (Windows, Mac). Text information can be browsed with the text-oriented 'lynx' client. Both client and server software is available from the CERN laboratory, where WWW was developed. The 'mosaic' client was developed at NCSA, but the software is mirrored an many sites. To start WWW on a terminal or terminal emulator, type

$ LYNX

After you have started 'lynx', you should see a screen full of text. Use the cursor keys to select an option and hit <RETURN> to activate the field.

To do a database search, you first need to find the "page" which offers the option to search databases. Once there, select one of the databases and you will be prompted to enter a keyword. The hits will be presented as menu options. Selecting either of these gets you the text on the screen. After having inspected a text file, you can either continue with <RETURN>, save it by pressing the <S> key. Save it now, and then convert the format of the retrieved sequence as described in section "Reformatting Sequences" .

NOTE: The filename of the sequence saved will change during this procedure.

If you have access to 'mosaic' at your site, you should use 'mosaic' instead of 'lynx'. Make sure that you have configured your DISPLAY correctly and type

$ mosaic

The proceeding is analogous to 'lynx', but you use the mouse to activate the desired field. Other browsers than 'mosaic' are available from various commercial enterprises. In academia, the Netscape browser is, at the time of this writing, licensed without charge. Other browsers are supplied as part of the workstation or communication software.


The SRSWWW System

The SRS system is accessible over the international networks via WWW.

SECURITY ADVICE: The databases accessed by the SRSWWW system do not run under the control of the local administration. Be aware that the results might not match the quality constraints applied usually. This is true in particular for non-academic sites.

At the time of this writing, the following servers were available:

 
  
URL                                        Site  
------------------------------------------+----------------------  
http://hubi.abc.hu:80/srs/srsc             ABC, Hungary  
http://ben.vub.ac.be:80/srs/srsc           BEN, Belgium  
http://wwwd.bmc.uu.se:80/srs/srsc          BMC, Sweden  
http://bioslave.uio.no:8001/srs/srsc       BiO, Norway  
http://www.ch.embnet.org:80/srs/srsc       Biozentrum, Basel, Switzerland  
http://www-srs.caos.kun.nl:80/srs/srsc     CAOS/CAMM, Netherlands  
http://cypress.csc.fi:8001/srs/srsc        CSC, Finland  
http://www.ebi.ac.uk:80/srs/srsc           EBI, Hinxton, UK  
http://www.embl-heidelberg.de:80/srs/srsc  EMBL, Heidelberg, Germany  
http://www.hgmp.mrc.ac.uk:80/srs/srsc      HGMP, Hinxton, UK  
http://www.infobiogen.fr:80/srs/srsc       INSERM, France  
http://iubio.bio.indiana.edu:81/srs/srsc   IUBio Archive, Indiana, USA  
http://seqnet.dl.ac.uk:80/srs/srsc         SEQNET, Daresbury, UK  
http://www.sanger.ac.uk:80/srs/srsc        Sanger, Hinxton, UK  
http://mcbi-34.med.nyu.edu:80/srs/srsc     Skirball Inst., NY, USA  
http://wehiz.wehi.edu.au:80/srs/srsc       WEHI, Australia  
http://dapsas.weizmann.ac.il:80/srs/srsc   Weizmann, Israel  
  

Getting Started

'mosaic' Users

First, open an extra 'mosaic' window to work in. You can do this by clicking on the <clone> button at the bottom of the screen. Then, pull down the <File> menu, select <Open URL> and type in one of the addresses listed above to get to the main SRS page.

'netscape' Users

First, open an extra 'netscape'n window to work in. You can do this by pulling down the <File> menu and selecting <New Window>. (You can also hit <ALT><N> on the keyboard.) Then, click on the <open> icon in the first row and type in one of the addresses listed above to get to the main SRS page.

Example: Searching Databases

Sequence databases are treated as one group of databases in SRS. They are indexed in the same way (as far as possible). You can either search all of them at the same time or select the ones you are interested in.

Sequence databases contain cross-references to each other and also to other biology databases, including the sequence-related databases.

The following is a step-by-step introduction.

This is only a short introduction. Use the system extensively to practice.


JAM produced file: HOW6.HTML as
[next page] , or [overview] , or [table of contents]