Section 6-2: Obtaining Data from Local Databases

[ Previous chapter ][ This chapter ][ Next chapter ] The following sections describe GCG software as well as additional software which may not be part of your installation.

Databases available at the local site usually include:

 
Database name          GCG name            contents 
  
----------------------------------------------------------------
  
EMBL + Updates   
  
GENBANK + Updates 
  
(GB as exclusion set)  GENEMBLPLUS:        all DNA databases (1)
  

  
SWISSPROT              SWISSPROT:          most proteins
  
PIR International      PIR:                most proteins
  
NEW entries of EMBL    EM_NEW:             EMBL new entries (2) 
  
GENBANK updates        GB_NEW:             GENBANK new entries (3)
  

1) The definition of GENEMBL can vary. Depending on the location, you can use either GENBANK with an exclusion set of EMBL data not found in GENBANK, or vice versa. Depending on whether you are connected to a network which is used to update data on a periodic basis, the GENEMBL set may include also daily updates.

2),3) The definitions vary. XEMBL, EM_NEW, EMBL_DAILY, GB_NEW, XSWISS, SW_NEW, PIR4, etc. are names that denote the character of the preliminary entries. Depending on your site and/or affiliation, those entries which are not found in either the EMBL or GENBANK update sets yet, possibly show up in the corresponding other set as so-called "exclusion set". Other site's GB_NEW and EM_NEW may contain all entries of GENBANK and EMBL, respectively, which can cause duplications.

NOTE: The term GENEMBLPLUS, introduced in GCG version 8.1, is equivalent to GENEMBL, which was used before version 8.1.


Subsection 6.2.1

Using the GCG Software: 'lookup'

The program 'lookup', introduced in GCG version 8.1, is GCG's implementation of the SRS software . In contrast to the original package, it searches only sequence databases. lookup has several levels of menus. The first one presents a list of sequence libraries which you can select to search. The next level provides the following option list:

 

  
Complete the query form below:
  

  
                 All text:
  
               Definition:
  
                   Author:
  
                  Keyword:
  
            Sequence name:
  
         Accession number:
  
                 Organism:
  
                Reference:
  
                    Title:
  
                  Feature:
  
  On or after (dd-mmm-yy):                On or before (dd-mmm-yy):
  
 Shortest sequence length:                Longest sequence length:
  

  
     Inter-field operator:  AND           Form of output list:  Whole Entries
  
                                            
  
There are several types of fields you can fill out:

Keep in mind that the searches are very fast, but may give you a lot of entries. Therefore, the program does not exit after the query has been launched with <CTRL><D>, but offers a third menu:

 

  
 17110 entries were found.
  

  
 Do you wish to:
  

  
   1) write out this list to a file
  
   2) preview the results
  
   3) refine the query
  
   4) choose different libraries
  

  
   q) quit
  

  
 Please choose one (* 1 *):
  

  
If you select option 1, a file is created which can be used by other programs. Option 2 displays the description of the sequence and option 3 allows you to refine the query.

NOTE:

1) Since the 'lookup' program requires significant resources, it may not be supported at the local site.

2) The 'lookup' program may generate lists which cannot be saved due to disk shortage. All lists generated are stored in files. If your query was not selective enough, these lists will become rather big.

3) Keep in mind that the lists will not be updated automatically. As sequence databases grow very fast, your search has to be repeated periodically if you use the lists for other purposes (e.g., sequence searching ).

Consecutive Searches

Consecutive searches are queries which do not search an entire database but a list of files created earlier. 'lookup' can search these lists effectively if the following syntax is used:

$ lookup -infile=@lookup.list


Subsection 6.2.2

Using the GCG Software: 'stringsearch'

The program 'stringsearch' is not as fast as 'lookup' and may not be optimally suited for your purpose. (Use this program only if you work with a GCG version older than version 8.1.) 'stringsearch' has 2 menu options:

The program identifies entries by searching the sequence documentation with keywords like 'globin' or 'human'. Example:

% stringsearch

 
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? 
  
Do you want to search through:     
  
	A) definitions     
  
	B) complete 
  
	sequence records 
  
Please choose one (* A *): 
  
Search for what text patterns ?  bluescript 
  
What should I call the output file (* genembl.strings *) ?
  

  
... 
  
*** Em_syn:ARBLSKP ***
  
pBluescript SK(+) vector DNA, phagemid excised from lambda ZAP 2,958bp
  
...     
  

  
Sequences searched:    69842 
  
Sequences with matches:        8        
  
Patterns sought: bluescript            
  
Output file: genembl.strings
  

  
NOTE:


Subsection 6.2.3

Find Sequences in the Databases with ATLAS

The ATLAS program is the successor of the XQS program, which replaced the NAQ and PSQ programs. None of this software is made or distributed by GCG. These programs have been created at Protein Identification Resource International (PIR) and can be obtained from either office of PIR. ATLAS has an extensive capability to search databases by author, entry name, accession number, or feature, just to name a few.

 

  
                               ATLAS of                         
  
                          PROTEIN and GENOMIC                              
  
                               SEQUENCE                      
  
                         Version 1.40, June 1992    
  
                (C) Copyright 1992 National Biomedical 
  
        Research Foundation National Biomedical Research Foundation
  
3900 Reservoir Road, NW Washington, DC 20007-2195 USA Tel: 202-687-2121     
  
             FAX: 202-687-1662E-Mail: PIRMAIL@GUNBRF.BITNET
  

  
We do not currently support the ATLAS software in Basel.


Subsection 6.2.4

Find Sequences in the Databases with SRS

The SRS program has been written by Thure Etzold, EMBL. The full SRS software is neither made nor currently distributed by GCG. The program has been created at EMBL and can be obtained from the author or by 'anonymous ftp'. SRS has an extensive capability to search databases by author, entry name, accession number, or feature, just to name a few. In addition, links provided in one database can be followed to get to the next entry, e.g., an EMBL entry can immediately be viewed as SWISSPROT entry, provided that there is an equivalent.

NOTE: To run the SRS program, your screen must speak the "vt100" language to display the text nicely.

% srs

Once started, select [U] for query and [S] for sequence and you will get the mask needed to compose a query. The [S] field opens upon a <SPACE BAR> and enables you to select the databases. At the time of this writing, VMS and UNIX versions of SRS were available which are expected to have the same functionality. The new release of SRS, version 4.x, supports an ASCII interface if the srscurs extensions are installed. 'srscurs' can be run via HASSLE . There is also a very powerful command line interface which is called 'getz'. This program is included in GCG release 8.1, were it is called lookup , and is available from the author (T.Etzold) on request. At the time of this writing, a networked version of SRS was in preparation. SRS is also accessible via the World Wide Web .


Subsection 6.2.5

Find Sequences in the Databases with ENTREZ

The ENTREZ program has been written by the programming staff at the NCBI. This software is neither made nor distributed by GCG. The program has been created at NCBI and can be obtained as CD-ROM distribution or by 'anonymous ftp'. ENTREZ runs on Mac/PCs, VMS, and UNIX. The latter two require an X-Windows interface.

The big advantage of ENTREZ is the inclusion of a subset of MEDLINE, covering the abstracts of entries submitted to the sequence databases. ENTREZ requires specific data sets which can be purchased on CD-ROM.


[ previous chapter ],[ this chapter ][ next chapter ] , [next page/section] , or [overview] , or [table of contents]