JAMF Archive

BioCompanion as published in 1995
THIS IS THE REFERENCE CODE AS PUBLISHED.
		Doelz, R.   
		Optimal production of biological documentation: the JAM format.
		Comput. Applic. Biosci. 11, 224-226 (1995).    
		
The version you are currently viewing is the one printed and distributed via the Internet from the server of BioComputing Basel. Version 3.1 of the BioCompanion was published with version 2 of the JAMF software. The server that was indicated in the documentation has ceased to exist.

Version 3.2 of the BioCompanion was not publicly available for free but was shareware that was distributed with GCG's software release 9. For the purpose of enhanced editing, JAMF was partially rewritten and the proprietary version 3.x of JAMF was used from 1996 onwards. The Biocompanion is available in a current version from the publisher . It has significantly changed both in software and content.

JAMF source code

LATEX version source code

	

location: Home > Archive > BioCompanion V2.x (1995)

Chapter 11: SequenceSearching

Sequence Searching


Tools for Sequence Searching

The requirement to identify a single sequence in a database is very much different from a keyword search. Keywords are expected to match exactly. This type of keyword searching has been described earlier .

Pattern searching programs will match a pattern exactly at single, defined positions in a sequence.

A sequence searching program , however, is expected to report

and, most important,

The major problem of sequence searching, therefore, is to find a reasonable definition for the similarity of sequences. Application programs doing "sequence searching" in general imply that each entry of a sequence database is compared to the query sequence sequentially, and the result is a list of database entries which are similar to the query sequence. The receipt of this program type reads as follows:

 
  
 start program   
 initialise top-scoring list  
 for each entry in database  
    {   
     compare database sequence with query sequence   
     evaluate similarity as "score"   
     compare "score" to top-scoring list  
     if "score" better than lowest entry   
        {   
         place entry in top-scoring list  
        }  
    }  
 normalise top-scoring list  
 for each entry in top-scoring list   
    {  
     determine statistics  
     print out result   
    }  
  
Depending on the algorithm or implementation, some of the steps might be missing or integrated in other steps. Sophisticated sequence searching which performs extremely fast might be based on two approaches:


Sequence Searching with Heuristic Methods

Principle of Similarity Detection

A powerful heuristic of a fast screening can be expressed as follows:

Two sequences are similar if a sufficient number of identical oligomers is found at a given arrangement of the two.

The internal loop of a sequence searching program (see above) requires the identification of a score, which is a numerical value describing the quality of a sequence comparison. The observed score depends on

The numerical values of scores are specific for a given program and must not be used for comparison between different algorithms or result output. Unless a statistical significance or another type of normalisation is computed, (see below) the score values are relative in relation to the search sequence and its algorithm. The following example shall demonstrate how a heuristic sequence search algorithm might be implemented. We do this on DNA level but no fundamental difference will be seen in protein comparisons, as the heuristic algorithms score with identities of short sequence fragments rather than using comparison tables . To a certain extent, the comparison using identities of oligomers can be visualised as a dotplot-type of comparison. We reuse the example of earlier chapters, with

 
  
atggtaatggcacaattgactttcctgaatttctga   Seq. A (formerly, horizontal sequence)  
tgatggtcaagtaaactatgaagagttt           Seq. B (formerly, vertical sequence)  
  
In contrast to the dotplot, however, we now create a table of oligomers and note the location of the occurrence of these oligomers in the sequences. We are going to use di-nucleotides here, which is for the sake of brevity only, as nucleotide sequences typically use larger "words" (such as six or more). Sixteen oligomers are possible, but not all of them are found in the two sequences:
 
  
      A           B                            A            B  
  
AA    6,14,28     9,13,14,21             CA    11,13        8  
AC    12,19       15                     CC    24           -  
AG    -           10,22,24               CG    -            -  
AT    1,7,15,29   3,18                   CT    20,25,33     16  
  
GA    18,27,35    2,20,23                TA    5            12,17  
GC    10          -                      TC    23,32        7  
GG    3,9         5                      TG    2,8,17,26,34 1,4,19  
GT    4           6,11,25                TT    16,21,22,30,31 26,27  
  
Having created such a table, we may compute now the largest segment of identity between the two sequences. The question, therefore, is whether we can find oligomers which match at a given, but identical shift of the two sequences. This can be readily achieved by calculating all observed differences and their occurrence:
 
Shift  
observed    -10 -9 -8 -7 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  6  7  8  9  10  
(in A) ----------------------------------------------------------------------  
  
Locations    25     8    13  7 11     2 14     1  6 22 18  4  5  6  9  16  
                      26    14  8       14        2 12    21 21  6     
                      27    30  9                 3 15    22    14  
                      28    31 19                 4  
                               20                17  
                               27                18  
                               30  
                               31  
The result could be counted in a very primitive fashion as in the dotplot-type of example. In dotplots, a point was shown for each matching oligomer. The columns in the table above represent identities of a diagonal. The most matches on a single diagonal occurs at a shift of -4 from sequence A to sequence B, compare this to what was already observed in the schematic comparison section . Similarly, the second-best column with a shift of +2 was already known. However, this type of crude approach might not be sufficient to find the best matching segment of identity as we ignore gaps entirely and just count occurences of oligomer matches on a diagonal.

To improve sensitivity, we need to search the columns of identical shift which occur at continuously increasing origin, such as 7,8,9 in the column of shift -4 above. Doing this, we find the following table of diagonals, sorted by the length of the observed diagonal

 
  
Length = (number of oligomers -1) + (size of oligomer)   
  
Segment with shift          2    -4    -7    -5    
Segment start position      1     7    26    13      
Length                      5     3     4     3  
  
The occurrence of mismatches will be noticed as an interruption of the segment. Gaps will occur as a change of shift. Interruptions and changes of shift, therefore, will not be considered in the initial identification, and many algorithms will not consider gaps at later stages either.

The 'fasta' program (as listed in the "programs" section below) has the init scores in the first two columns of its numerical output which refer to the crude and the joined diagonals, respectively. Joined diagonals are found by extending the segment. The extension will be counted if adjacent diagonals can be joined to increase the overall score of the newly found segment.

The 'blast' type of programs, also mentioned below, will gain speed by determining the occurrence of oligomers in the database sequences in a preprocessing step (to be executed once only at formatting time of the database) before the actual search is started. This reduces the need to scan the database each time for every comparison. However, as the pre-computation is done on the entire database, the arbitrary search of subsections of the database will no longer be possible without executing the format process into a 'blast' database).

After collecting sequences of interest, the ranking and display of the found alignments is performed, which is also a crucial step for successful result evaluation. The most difficult problem of reliable sequence similarity searching is the proper selection of criteria which describe statistical relevance and map as advantageous as possible to biological relevance.

The 'fasta' type of programs will redo the comparison of the top-score entries as found in the database with a rigorous algorithm (such in the GCG program bestfit and in rigorous searches as described below). A more accurate alignment of the database sequences with the query sequence, which eventually goes along with a re-scoring of the hits, will allow to increase the power of a sequence comparison. It should be kept in mind that the re-scoring (in the 'fasta' programs called the opt score) will not be reflected in the listing order of top-scoring entries as these are sorted for the so-called "initn" score (the joined fragment score).

NOTE: The fasta 2.x version of this package adds very comprehensive statistics. However, at the time of writing, the GCG software did not include this version of the fasta package.

The 'blast' programs use a statistical approach in order to list only sequence segments in the result which are not expected by chance. To do this, it is necessary to determine a score which can discriminate between "hit" and "random". The information unit is a "bit" and might be explained as follows: The amount of information stored in biological sequences is, from a statistical point of view, a function of the sequence composition. E.g., a poly-A stretch in a DNA sequence has the information only A and the length of the sequence. Information can be measured in bits, which is either YES or NO, or 0 or 1 in computer science world. A nucleotide, therefore, which can be any of four characters, can be written in two bits as the relevant information of four different symbols requires four different combinations:

 
  
     A 00  
     G 01  
     C 10  
     T 11  
  
The information contents of a match, if expressed in bits of information, will therefore contribute to the ranking of the sequence. The 'blast' type of programs uses this method to determine the relevance of a sequence searching result. Additionally, the probability of finding the segment of interest by chance is calculated on the basis of the database and query information content. The lower the probability in its value, the higher the significance, as the hit will less favourably occur by chance. The concept of information contents can be used to introduce thresholds in sequence comparison, which will determine whether a match is carried or a score is computed at all: If the information content of a sequence is below such a threshold, no comparison will be made, and the result of a search will be that no relevant hits will be reported. This is a rare case but will occasionally occur in 'blast' searches.

Expectations

The amount of data you will need to review after a sequence search might be enormous. To help you evaluating the result, most searching programs will permit to generate a histogram which displays the scoring table graphically: The number of found scores is plotted against the score. The very large peak in the area of low scores can be attributed to "random" hits which occur by chance. Depending on algorithm and sequence, the hits of remote similarity will occur in the downhill area of this peak. Sequences from other organisms score at higher values, and the identity score is, typically, at vary high values if a sequence is found in the database which matches the query sequence directly. The following graph shall illustrate this description of a search histogram: (NOTE that the peak at low scores has been purposely truncated and is not shown in its full size).

 
  
number   
of hits  
  
^      
|    ////   statistical  
|    *  *   noise   
|    *  *  <----------  
|    *  *  
|    *  *                   species   
|    *   *                similarity  
|    *    *   remote       score  
|   *     *  similarity      |              
|   *      *   |             |          identities  
|   *       *  v             v             |  
|  *         * *            *              v  
| *             *          * ***           *  
+-----------------*********-----***********-***-->   
                                             score                                     
The discrimination between "remote similarity" and "statistical noise" is crucial and severely depends on the algorithm and the data used for comparison.

NOTE: Unless a reliable statistical estimation is part of the search program, you should investigate the region of the beginning "statistical noise" carefully. However, keep in mind that statistical relevance might not match biological requirements and might be misleading.

The significance of searches can be improved if the search is conducted on the protein level rather than the DNA level. This is due to the fact that codon usage differences between different algorithms will increase the number of mismatches even if the protein resulting from the two different sequence fragments will be identical. Programs are available which will search the DNA sequence database after an on-the-fly translation to all possible six reading frames. Doing these kinds of searches, two major considerations will apply:

The 'framesearch' program of the GCG package takes care of this effect to some extend, but utilises extreme resources for completion.

Programs

$ blast

This program will be able to do searches on protein level if you use DNA sequence databases. The original NCBI blast program package as available from the NCBI includes the following programs:

The GCG implementation of the 'blast' program suite uses a single program - 'blast' - which launches any of the programs mentioned above. (This is called a front-end program).

The 'blast' suite is a program which may run either locally or via network. The 'blast' system is not implemented at the VMS site, but runs on the UNIX cluster via the HASSLE system. Databases available at Basel, or via the HASSLE system include:

 
  
Database name          GCG name          contents   
----------------------------------------------------------------  
EMBL + Updates     
GENBANK + Updates   
(GB as exclusion set)  nr      	         all DNA databases (1)  
SWISSPROT              swissprot	 most proteins  
SWISSPROT +  
PIR International   
+ PATCHX + OWL         nr                all peptide databases (2)   

1) The definition for "nr" might vary. Depending on the location, you might use either GENBANK with an exclusion set of EMBL data not in GENBANK, or use EMBL with an exclusion set of GENBANK (e.g., in Basel). Depending on whether or not you are connected to a network which is used to update the data on a periodic basis, the "nr" set includes also daily updates. At Basel, both EMBL and GENBANK are updated weekly, and EMBL is the basis of exclusion using the NCBI's 'nrdb' program.

2)The definition of "nr" might vary from site to site. At Basel, all available databases and their corresponding updates are computed on a weekly basis with NCBI's 'nrdb' program.

Additional databases are available at some sites, and will be displayed at the menu if you start the 'blast' program suite. It might happen that the 'blast' suite of programs is not enabled or temporarily unavailable at the time of your command due to resource limitations. It should be kept in mind that 'blast' programs are good for screening but should always be supplemented with other screening methods (e.g., swsearch or MPsrch, see below) in order to confirm findings.

General purpose, fast and reliable searches are done using

$ fasta

The GCG implementation of the 'fasta' program is typically a (s)lower version than the one distributed by the Author William Pearson. The "original" version of the 'fasta' program (i.e., not the version adopts by GCG) is much faster and searches only one strand in DNA searches. However, version 2.x and higher of the fasta package do an automatic statistical evaluation of the result. Problems with the 'fasta' programs are most usually observed when the database has been specified incorrectly. Use the database name and a colon with an asterisk to give the correct specification, following the instructions below.

Databases available at Basel University include:

 
Database name          GCG name            contents   
----------------------------------------------------------------  
EMBL + Updates     
GENBANK + Updates   
(GB as exclusion set)  GENEMBL:            all DNA databases (1)  
  
SWISSPROT              SWISSPROT:          most proteins (2)  
PIR International      PIR:                most proteins  
PATCHX + PIR           MIPSX:              MIPS merged database (3)  
  
NEW entries of EMBL    XEMBL:              EMBL new entries (4)   
UPDATED entries EMBL   XXEMBL:             EMBL updated entries (5)  
  
GENBANK update excl.   GB_NEW:             GENBANK exclusion (6)  

1) The definition of GENEMBL can vary. Depending on the location, you can use either GENBANK with an exclusion set of EMBL data not found in GENBANK, or vice versa (e.g., in Basel). Depending on whether you are connected to a network which is used to update data on a periodic basis, the GENEMBL set may include also daily updates.

2) containing weekly updates

3) PATCHX is updated quarterly and includes the previous release of SWISSPROT, an automatic translation of EMBL, and some other databases.

4) The definitions vary. XEMBL, EM_NEW, EMBL_DAILY, GB_NEW, XSWISS, SW_NEW, PIR4, etc. are names that denote the character of the preliminary entries.

5) This is a Basel-specific item. The main purpose of this database is to find new data in the annotation, as updates rarely include changes in the sequence. In order to have the main EMBL database show not too many entries in FASTA runs, the XXEMBL database is not included in the usual GENEMBL set.

6) This is a Basel-specific item. The weekly updated GENBANK database is calculated against EMBL and XEMBL to find those entries which are not in the EMBL updates yet. Additional databases are available at Basel. Their names are displayed when you start the molecular biology environment. Examples are Amos Bairoch's PROSITE database of protein motifs, or Rich Robert's REBASE database of restriction enzymes.

NOTE: The term GENEMBLPLUS, introduced in GCG version 8.1, is equivalent to GENEMBL. This is a deviation from the standard GCG installation which uses GENEMBL:* to describe all databases except EST and STS sections.

If you need to search for peptide in the DNA database, this can be achieved with the tfasta program, which translates the DNA database on-the-fly into all possible reading frames. This search is computationally more expensive but considered to be more sensitive (see above, and below).


Rigorous Searching in the Twilight Zone

Principle

Very sensitive search program implementations use the "Smith and Waterman" algorithms. In contrast to the heuristic methods mentioned above, rigorous searching will compute a complete alignment of each possible sequence pair of the query sequence versus the database sequence. Depending on the program or implementation, various matrices will be used that that time. Refer to the pairwise comparison section for details.

Programs

The framesearch program of the GCG package does this type of search for a protein sequence if a suitable DNA library is specified.

Programs running the Smith and Waterman type of rigorous searching might use quite a long time to achieve completion, or require special hardware in order to complete in shorter time. In particular, searches in DNA databases can take significant resources. Famous programs are 'swsearch' on the Bioccelerator and 'MPsrch' on the MasPar Computer.

If you happen to have access to W.Pearson's sequence analysis software you could try the ssearch program. This requires that you should first convert the sequences of interest to STADEN format with the command tostaden . Alternatively, you can use the program readseq .

Also, via HASSLE, the programs by Coulson et al. (marketed by IntelliGenetics, Inc.) are available as

$ mpsrch

If you can get hold of an alignment of several sequences, and can produce a profile, use the program 'profilesearch' (see section on patterns .).


Searching Strategies

Frequently, the combination of methods will get more comprehensive results than a single search. Therefore, even if the first trial of a sequence search produces apparently satisfactory results, it is suggested to run all available methods. Additionally, the following measures will help.

Tuning of your Sequence

Use the sequence editor seqed to create smaller sequences (100 bp, or 30 AA) , or cut out frequently occurring parts such as ALU I repeats. The following criteria might be used to split DNA sequences:

Translate DNA

Determine the reading frame with the single-sequence analysis methods (e.g., frames and convert the DNA sequence to a protein with map followed by extractpeptide ), and run the search on protein level with tfasta instead of DNA level.

Tuning of the 'fasta' Parameter "word size"

Default settings are:

 
  
           2 for proteins            6 for DNA  
  
To get different output, try
 
  
           1 for proteins           3 for DNA  (or even 1)  
  

Tuning of the 'fasta' Parameter "list size"

Default setting:

 
  
           40   
  
to get longer lists, try:
 
  
          100  
  

Statistics Analysis of Hits

The software package 'fasta' from W.Pearson contains the 'prdf' program to analyse the results of arbitrary sequence pairs. However, version 2.x of fasta does statistical analysis automatically.

If the EGCG package programs are installed you might try fastacheck to check for significance of the results obtained.

In case of doubt, you might use the 'bestfit' program with the randomise option. Make sure that you give at least 200 randomisations to get a reasonable statistical distribution. Alternatively, you might use the shuffle program to generate a new sequence with identical length and composition. However, as the ordering of the symbols is different, the subsequent search should give significantly different groups of hits than the original search sequence.

If you assume that your sequence is similar to a given group but failed to detect it with the selected search algorithm, you might consider to run a "prototype" search and use the list of sequences as subset (see below).

The fastalert program as developed by F.Eggenberger at BioComputing Basel is a network application which will do the statistical analysis for you.

Mapping Result Data

Sequence similarity searches will result in a list of sequences which is reported to be similar to the original. However, in contrast to a pattern search, query sequences might be of considerable length, and, therefore, show similarity to other sequences in several regions. This requires that the inspection of the sequence searching output is classified by sequence coordinates of the query sequence. As no programs do currently exist which will allow for an automatic assignment, manual mapping of the detected sequence features is required. This manual mapping might also go along with the labelling of additional sequence features as revealed by the method of single sequence analysis .

Analysis of Target Sequences

If several hits are encountered in the result of a sequence search, a close inspection of the actually occurring hits is essential. It might sound trivial but a title of a sequence, if listed in a search output, will not allow the conclusion that the segment of similarity actually counts for the functionality of spotted protein. Rather, a look in the annotation of the sequence is required in order to confirm that the segment of similarity is relevant for protein function. In order to determine whether the similarity is accidental or meaningful, the seqed sequence editor might be used to partition the sequence of interest and search the detected similarity as separate sequence. The following Figure shall illustrate this technique schematically:

 
                      
          Region of          
         similarity  
 |------------------------------>  query sequence   
           ||||||  
        |------------------------------>  database sequence   
          :      :  
          :      : Redo the sequence search with the   
           ------  isolated fragment of database sequence   
  
This second sequence search should retrieve a similar pattern than the original search if the homology was significant. Careful inspection might also be useful to identify this segment as a member of a sequence family which can be used further on to validate the originally found sequence.


Use of Specific Searching Libraries

Very low similarity of sequences might not be easily detected if the search is performed in the entire database. Due to the noise level of similarities scored by chance, important matches might be missed. The use of filters is essential in this case. A filter is any procedure applied to reduce the total number of sequences searched, most desirably using criteria which match the expectation of the performed search. These might be

The basic difference in these methods is the way how the sublibraries are addressed. Depending on the algorithm, some of the procedures might not be available to you. E.g., the 'blast' database searching programs will not allow the use of user-specified subsets.

Database Sub-Libraries

There is a special manual of the GCG package which will tell you about database sub-libraries (see below). Depending whether your site honors the EMBL database or the GENBANK database as base set, the corresponding counterpart will be available as subset. This results in the effect that the GCG program package always has both EMBL and GENBANK logicals defined even if a subset contains only a small amount of sequences. In rare occations, these subsets might be even entirely empty - this will happen if EMBL and GENBANK subsections are perfectly in sync.

The WPI version of the interface will present these database subsections to you in database-neutral fashion if you use the correct window .

To see what sub-libraries are supported, you might try to obtain an on-line list as follows in the command line version:

$ show log EM*

$ show log GB*

Use the resulting names as GCG libraries. Additional help is provided in the data set manual of the GCG package. E.g., the EST:* specification applies if you are interested only in the expressed sequence tag section of the EMBL data library.

WARNING

The EST section of the DNA databases usually cover all sorts of species. If you want to utilize data subsections by organism rather than in its entirety you would presumably need to employ large lists (such as created with suitable search programs ) ans process these as described below.

TIP

The SWISS-PROT database uses the organism name as part of the entry name. E.g., Swissprot:*yeast will cover all yeast sequences.

Sequence Lists (formerly File Of Sequence Names (FOSN))

To use groups of sequences, a reasonable paradigm is supplied by each program package in a specified syntax. This syntax tells the software that the specification given shall be used as group of sequences rather as a single sequence. The GCG package calls this mechanism a Sequence List. Documentation before the 8.0 release of the package might refer to this feature as a File of Sequence Names (FOSN). The idea is straightforward: Programs do no longer read the sequence from a file which specifies the sequence data but rather use a file as a "pointer" where to look for data, i.e., they read the sequence from a file which is specified in a list file.

To maintain compatibility with the established input handling, files which specify a list of sequences rather than a sequence directly are tagged by the character @ (English spelling: "at" character). A Sequence List is produced by a number of programs, such as:

To utilize the resulting file of sequence names, you might use the @ character in front of the file name in all programs which use multiple files. Sequence searching is such an application. To use a Sequence List as a library, e.g., @my.fil in the fasta program, you may use this nomenclature at the prompt "which sequence(s)?"

NOTES

1) The file-of-sequence-names method might not be available if you run your sequence analysis via networks. 2) The 'blast' suite of programs cannot use file of sequence names and requires own database formatting (see below ). 3) WPI users may use Sequence Lists much more conveniently by using the correct window - see below.

To reformat Sequence Lists into other formats, refer to the reformatting section .

Multiple Sequence Files (MSF)

The sequences of a List of sequences are not stored in the list file itself. Rather, the List file is a file of pointers to the files which shall be worked on. This implies the danger that, if a file being pointed to is deleted, the list of sequences is no longer valid.

An alternative for the lists of sequences, therefore, is the option to write all sequence data into a single file. This will enlarge the file size, and also require that a specific format is defined which allow multiple data rather than a single sequence in one file. Most conveniently, such a file is produced by the program 'pileup' . This application produces a multiple sequence alignment automatically, and stores the result in a single file, including gaps and the specific shift for each sequence. Multiple Sequence Files (MSF) (*.msf) are named as my.msf{*} Details on the 'pileup' program are in the section of the multiple sequence analysis .

NOTE: Due some technical problems with localised keyboards it might be difficult for you to display the characters "{" and "}" by typing the corresponding characters on the keyboard. Use the command 'genhelp distances example' and use the COPY option of your terminal or terminal emulator to take the {*} into the Paste buffer. PASTE the resulting keystrokes where appropriate.

To reformat multiple sequence files into other formats, refer to the reformatting section .

Lists within the Wisconsin Package Interface (WPI)

Since Version 8, you may use the Wisconsin Package Interface (WPI) via the X-Windows system. Lists are readily handled and the base principle of this user interface. Specifically, lists might be expanded with a mouse click to select idividual sequences. Refer to the corresponding WPI section for details.

Impact of Electronic Networks and Time Effects

The usage of Lists might be restricted as not all databases are available at each site. Specifically, if you run your sequence analysis via networks or move from one site to another the lists might become affected if site-specific features are included.

Keep in mind that Lists are created at a defined point in time. If you use keyword searching , your List will reflect the status of the database at this specific time point. You might want to redo the keyword search frequently in order to maintain an up-to-date set of sequences. See also the notes below on the creation of own databases.

Lists are notorious troublemakers if disk space is tight and references are made to specific user-provided files. This implies that any 'cleaning' of sequence files from your directories might render lists unusable if the references are obsolete. Similarly, if you work on several machines, the simultaneous use of Lists on different computers implies the identical directory structure and the presence of all desired files in the expected locations. The output of the the profilesearch program (see next chapter ) is a List as well, and known to inherit the location of the data used for searching. Eventually, manual editing is required to overcome this limitation.

Creation of own Databases

Lists of sequences or Files of Sequence names can be very time-consuming if you need to search a large amount of data. If you have enough disk space, you can create or ask your system manager to create how to create your own database with the command dataset .

NOTE:

================================= Begin Exercise 12

Sequence searching: Use of the 'blast' and 'fasta' searching programs to analyse DNA and protein sequences derived from previous analysis.

 
| Query  |    Database    | Feature or name of   
| from-to| entry  |from-to| identified sequence    
|--------+--------+-------+------------------  
|        |        |       |    
|        |        |       |    
|        |        |       |    
|        |        |       |    
|        |        |       |    
|        |        |       |    
|        |        |       |    

 
| Reading | Query |    Database    | Feature or name of   
|frame no.|from-to| entry  |from-to| identified sequence    
|---------+-------+--------+-------+---------------------  
|         |       |        |       |  
|         |       |        |       |  
|         |       |        |       |  
|         |       |        |       |  
|         |       |        |       |  
|         |       |        |       |  
|         |       |        |       |  
================================= End Exercise 12


JAM produced file: SEQUEN11.HTML as
[next page] , or [overview] , or [table of contents]