Section 11-5: Use of Specific Searching Libraries

[ Previous chapter ][ This chapter ][ Next chapter ] Very low similarity of sequences might not be easily detected if the search is performed in the entire database. Due to the noise level of similarities scored by chance, important matches might be missed. The use of filters is essential in this case. A filter is any procedure applied to reduce the total number of sequences searched, most desirably using criteria which match the expectation of the performed search. These might be

The basic difference in these methods is the way how the sublibraries are addressed. Depending on the algorithm, some of the procedures might not be available to you. E.g., the 'blast' database searching programs will not allow the use of user-specified subsets.


Subsection 11.5.1

Database Sub-Libraries

There is a special manual of the GCG package which will tell you about database sub-libraries (see below). Depending whether your site honors the EMBL database or the GENBANK database as base set, the corresponding counterpart will be available as subset. This results in the effect that the GCG program package always has both EMBL and GENBANK logicals defined even if a subset contains only a small amount of sequences. In rare occations, these subsets might be even entirely empty - this will happen if EMBL and GENBANK subsections are perfectly in sync.

The WPI version of the interface will present these database subsections to you in database-neutral fashion if you use the correct window .

To see what sub-libraries are supported, you might try to obtain an on-line list as follows in the command line version:

% name | egrep 'EM | GB'

Use the resulting names as GCG libraries. Additional help is provided in the data set manual of the GCG package. E.g., the EST:* specification applies if you are interested only in the expressed sequence tag section of the EMBL data library.

WARNING

The EST section of the DNA databases usually cover all sorts of species. If you want to utilize data subsections by organism rather than in its entirety you would presumably need to employ large lists (such as created with suitable search programs ) ans process these as described below.

TIP

The SWISS-PROT database uses the organism name as part of the entry name. E.g., Swissprot:*yeast will cover all yeast sequences.


Subsection 11.5.2

Sequence Lists (formerly File Of Sequence Names (FOSN))

To use groups of sequences, a reasonable paradigm is supplied by each program package in a specified syntax. This syntax tells the software that the specification given shall be used as group of sequences rather as a single sequence. The GCG package calls this mechanism a Sequence List. Documentation before the 8.0 release of the package might refer to this feature as a File of Sequence Names (FOSN). The idea is straightforward: Programs do no longer read the sequence from a file which specifies the sequence data but rather use a file as a "pointer" where to look for data, i.e., they read the sequence from a file which is specified in a list file.

To maintain compatibility with the established input handling, files which specify a list of sequences rather than a sequence directly are tagged by the character @ (English spelling: "at" character). A Sequence List is produced by a number of programs, such as:

To utilize the resulting file of sequence names, you might use the @ character in front of the file name in all programs which use multiple files. Sequence searching is such an application. To use a Sequence List as a library, e.g., @my.fil in the fasta program, you may use this nomenclature at the prompt "which sequence(s)?"

NOTES

1) The file-of-sequence-names method might not be available if you run your sequence analysis via networks. 2) The 'blast' suite of programs cannot use file of sequence names and requires own database formatting (see below ). 3) WPI users may use Sequence Lists much more conveniently by using the correct window - see below.

To reformat Sequence Lists into other formats, refer to the reformatting section .


Subsection 11.5.3

Multiple Sequence Files (MSF)

The sequences of a List of sequences are not stored in the list file itself. Rather, the List file is a file of pointers to the files which shall be worked on. This implies the danger that, if a file being pointed to is deleted, the list of sequences is no longer valid.

An alternative for the lists of sequences, therefore, is the option to write all sequence data into a single file. This will enlarge the file size, and also require that a specific format is defined which allow multiple data rather than a single sequence in one file. Most conveniently, such a file is produced by the program 'pileup' . This application produces a multiple sequence alignment automatically, and stores the result in a single file, including gaps and the specific shift for each sequence. Multiple Sequence Files (MSF) (*.msf) are named as my.msf{*} Details on the 'pileup' program are in the section of the multiple sequence analysis .

NOTE: Due some technical problems with localised keyboards it might be difficult for you to display the characters "{" and "}" by typing the corresponding characters on the keyboard. Use the command 'genhelp distances example' and use the COPY option of your terminal or terminal emulator to take the {*} into the Paste buffer. PASTE the resulting keystrokes where appropriate.

To reformat multiple sequence files into other formats, refer to the reformatting section .


Subsection 11.5.4

Lists within the Wisconsin Package Interface (WPI)

Since Version 8, you may use the Wisconsin Package Interface (WPI) via the X-Windows system. Lists are readily handled and the base principle of this user interface. Specifically, lists might be expanded with a mouse click to select idividual sequences. Refer to the corresponding WPI section for details.


Subsection 11.5.5

Impact of Electronic Networks, Time Effects and Location

The usage of Lists might be restricted as not all databases are available at each site. Specifically, if you run your sequence analysis via networks or move from one site to another the lists might become affected if site-specific features are included.

Keep in mind that Lists are created at a defined point in time. If you use keyword searching , your List will reflect the status of the database at this specific time point. You might want to redo the keyword search frequently in order to maintain an up-to-date set of sequences. See also the notes below on the creation of own databases.

Lists are notorious troublemakers if disk space is tight and references are made to specific user-provided files. This implies that any 'cleaning' of sequence files from your directories might render lists unusable if the references are obsolete. Similarly, if you work on several machines, the simultaneous use of Lists on different computers implies the identical directory structure and the presence of all desired files in the expected locations. The output of the the profilesearch program (see next chapter ) is a List as well, and known to inherit the location of the data used for searching. Eventually, manual editing is required to overcome this limitation.

If you change location (e.g., move from the US to Europe or vice versa), you will possibly change environments. DNA databases are usually based on EMBL with GENBANK exclusion (such as in Europe), or GENBANK-based with EMBL as exclusion (such as in the US). Pacific rim sites will use DDBJ as base database. Unfortunately, the database names used by GCG software reflect the database origin, and therefore are not transferrable easily. The only option to circumvent this 'feature' is to map one List into another using the SRS software . However, keep in mind that this option will consume a considerable amount of resources if you are not used to such work or cannot rely on local experts.


Subsection 11.5.6

Creation of own Databases

Lists of sequences or Files of Sequence names can be very time-consuming if you need to search a large amount of data. If you have enough disk space, you can create or ask your system manager to create how to create your own database with the command dataset .

NOTE:

================================= Begin Exercise 12

Sequence searching: Use of the 'blast' and 'fasta' searching programs to analyse DNA and protein sequences derived from previous analysis.

 
| Query  |    Database    | Feature or name of 
  
| from-to| entry  |from-to| identified sequence  
  
|--------+--------+-------+------------------
  
|        |        |       |  
  
|        |        |       |  
  
|        |        |       |  
  
|        |        |       |  
  
|        |        |       |  
  
|        |        |       |  
  
|        |        |       |  
  

 
| Reading | Query |    Database    | Feature or name of 
  
|frame no.|from-to| entry  |from-to| identified sequence  
  
|---------+-------+--------+-------+---------------------
  
|         |       |        |       |
  
|         |       |        |       |
  
|         |       |        |       |
  
|         |       |        |       |
  
|         |       |        |       |
  
|         |       |        |       |
  
|         |       |        |       |
  
================================= End Exercise 12


[next page] , or [overview] , or [table of contents]