JAMF Archive

BioCompanion as published in 1995
THIS IS THE REFERENCE CODE AS PUBLISHED.
		Doelz, R.   
		Optimal production of biological documentation: the JAM format.
		Comput. Applic. Biosci. 11, 224-226 (1995).    
		
The version you are currently viewing is the one printed and distributed via the Internet from the server of BioComputing Basel. Version 3.1 of the BioCompanion was published with version 2 of the JAMF software. The server that was indicated in the documentation has ceased to exist.

Version 3.2 of the BioCompanion was not publicly available for free but was shareware that was distributed with GCG's software release 9. For the purpose of enhanced editing, JAMF was partially rewritten and the proprietary version 3.x of JAMF was used from 1996 onwards. The Biocompanion is available in a current version from the publisher . It has significantly changed both in software and content.

JAMF source code

LATEX version source code

	

location: Home > Archive > BioCompanion V2.x (1995)

Chapter 12: SequenceFamilies

Sequence Families


Principle of Multiple Sequence Alignment

Once a sequence search is completed, the question arises whether the found similarities do share a similarity amongst each other. This can be achieved in either automatic or manual fashion by using programs which will align the sequences of interest.

Prerequisites

If you painted a map from the result of your sequence search as described earlier , it might be obvious that sequences do usually share similarity only in parts. This will leave the ends or overhang parts of two sequences badly aligned due to low similarity. Therefore, before alignments are attempted, it is a good practice to create sequence fragments of approximately the same length which will allow programs to operate more easily.

If sequences are not specifically taylored for multiple sequence alignment, programs might fail or report alignmnets unreliably.

Finding the Best

The approach used for automatic sequence alignment can be described as "clustering" of the most similar sequences. In a first step, the program will need to find the sequence pair(s) which share(s) the most obvious similarity. To achieve this, each sequence is compared to each, which results in (n*n)/2 comparisons if we have n sequences to compare. As in rigorous sequence searching, a comparison is made using sequence comparison tables to compute the best possible alignment and score this appropriately. (Note that the scores will be not as desired if the sequences have not been tailored as mentioned above).

Grouping

Once the comparison for each possible sequence pair has been completed , the "best" candidates serve as nuclei, and additional sequences are aligned to the already existing alignment. This will work well with similar proteins but too many gaps, in particular on DNA level, will most probably not yield the desired result. The largest errors will occur if regions with low similarity are used as "closest" set, as these will cause trouble for additional sequences to be matched.

If problems are encountered because similarity cannot be determined well enough automatically, either manual alignment is required or the selection of sequences must be improved by tayloring or omission of very remotely related fragments.

Result Evaluation

The result of a multiple sequence alignment will be a block of sequences which are nicely painted on top of each other. Programs exist which will plot the degree of similarity along the sequence coordinate. Other programs allow to print or paint the output nicely. The GCG programs also produce a figure which schematically displays the level of similarity as a dendrogram. As outlined below, the dendrogram which illustrates sequence similarity must not mistakenly be interpreted as phylogenetic tree, however, can be used to verify that the alignment proceeded as expected.

Limitations

Multiple Sequence Alignment is NOT the tool for you if you are working on fragment assembly or shotgun sequencing. In order to align multiple sequences reliably, the similarity amongst the members of the alignment should be extensive along the entire length rather than only overlapping fragments.


Programs to Deal with Multiple Sequences

Manual Editing with the Multisequence Editor

If you start from scratch, use the command

$ lineup

The screen will ask

 
  
Lineup of what sequence group ?  
  
And you are requested to type in a name (try to use less than 9 characters, and use only characters or numbers). Use all-lowercase letters . The screen, will then open and with a similar display to the normal sequence editor seqed .

Add a New Sequence

Move to the command line with <CTRL><Z>, give the command 'new', type the new sequence name.

 
  
      <CTRL><Z>      
      : new  
Create a NEW sequence (ten letter max.):   
test1  
  
Move to desired position with arrows. Press <RETURN> to select position.
 
  
<CURSOR-UP> <RETURN>  
NEW sequence named test1 placed at 0,1  
  
Include an Existing Sequence

Move to the command line with <CTRL><Z>, give the 'get' command, and type the new sequence name. The sequence given is either a sequence in your own directory as created with commands from the GCG package or a sequence from the database.

 
  
      <CTRL><Z>      
      : get  
GET what sequence:   
protein:mchu  
LINEUP get of protein:mchu   
  
      Begin (* 1 *) ?      <RETURN>  
      End (* 149 *) ?      <RETURN>  
      Reverse (* No *) ?   <RETURN>  
      That Begins: MADQLTEEQI   
         and Ends: EEFQMMTAK  
Is this what you want to include (* Yes *) ? <RETURN>  
  
Move to desired position with arrows. Press RETURN to select position.
 
  
<CURSOR-UP> <RETURN>  
Enter name (ten letter max.): Mchu <RETURN>  
LINEUP get of "protein:mchu" from: 1 to: 149   
....   
  
Navigation

The 'lineup' editor works similar as the 'seqed' program discussed earlier for single sequence input. However, as multiple sequences will show as several lines, the <CURSOR-UP> and <CURSOR-DOWN> keys will be used to jump between different sequences in the alignment. The period (.) key will be used to insert gaps.

CAUTION: If you have a key mapping file in the current directory such as in use for sophisticated use of the 'seqed' program the period might be missing, therefore, does not work in 'lineup'. SOLUTION: Delete the file or add the period accordingly.

Consensus Calculation

One of the sequences (the one at line 0) is special: It might hold a consensus sequence which is automatically updated upon gap insertion or sequence shifting. To activate this mechanism, move to the move to the command line (<CTRL><Z>), and type auto.

Get Help

Move to the command line (<CTRL><Z>), and type help.

Exit

Move to the command line (<CTRL><Z>), and type exit.

Manual Editing of Sequence Alignments

If you used the lineup program earlier, it will have created a so-called file of sequence names (FOSN, extension *.fil) and numerous fragments which represent your sequences in its lined-up form. For example, to reload an alignment of the group eco, call the lineup program with

$ lineup eco

If you used another program to produce a multisequence alignment (e.g., the program pileup ), this might be in the multiple sequences format ( MSF, extension *.msf). To use 'lineup' on a file called eco.msf, call it as

$ lineup/msf eco

You can reformat each of the formats into the other with the command reformat. Use the section of genhelp to learn about how to convert MSF to FOSN format and FOSN to MSF format.

Manual Editing of File of Sequence Names

If you need to name more than one sequence, you can use asterisks (*) as "wildcards". (See section file handling for use of wildcards in file naming conventions). The GCG programs, however, allow you to write a file which contains only filenames rather than the files itself. To create such a list file, call the system editor (see section editing ) and enter all file names of the sequences you are interested in after having entered a line with two periods ("..") as the first line. You can mix either your own sequences or use names from the database. Refer to the section on Lists for details on GCG list files.

Automatic Creation of File of Sequence Names

As described above, you can use several GCG and GCG-like programs to produce a file of sequence names (also known as Lists ). Remember that the file should contain only sequence names after the two periods (This is taken care of by the GCG programs automatically if applied correctly).

Automatic Generation of a Multiple Sequence Alignment

The GCG program pileup can align many sequences by specifying them either as single files using wildcards (e.g., *.seq) or by using a file of sequence names and specifying these as @my.fil (if the file my.fil contains the sequence names). (This is usually called automated multisequence alignment). As an example, the result of pileup utilising data from a findpatterns run is shown:

$ pileup

 
PILEUP creates a multiple sequence alignment from a group of   
relate sequences using progressive, pairwise alignments.  It can   
also plot tree showing the clustering relationships used to   
create the alignment.   
PileUp of what sequences ?  @findpatterns.find     
1            MCHU   149 aa     
2            MCRB   148 aa     
3            MCRT   149 aa     
4            MCBO   148 aa   
What is the gap weight (* 3.00 *) ? <RETURN>   
What is the gap length weight (* 0.10 *) ? <RETURN>   
This program can display the clustering relationships  graphically.   
Do you want to:       
           A) Plot to a FIGURE file called "PileUp.Figure"       
           B) Plot graphics on HP7550 attached to /dev/tty       
           C) Suppress the plot Please   
choose one (* A *): <RETURN>   
The minimum density for a one-page plot is 4.0 sequences/100 platen units.   
What density do you want (* 4.0 *) ? <RETURN>   
What should I call the output file name (* pileup.msf *) ? <RETURN>   
Determining pairwise similarity scores...     
1   x     2       1.49     
1   x     3       1.50     
1   x     4       1.50     
2   x     3       1.49     
2   x     4       1.49     
3   x     4       1.50   
Aligning...     
1     .......-.     
2     .......-.     
3     .......-.   
  
FIGURE instructions are now being written into pileup.figure.          
Total sequences:          4         
Alignment length:        149                 
CPU time:      00.84              
Output file:/biox/biocomputing/doelz/pileup.msf        
  
(The output file name would be slightly different but the procedure is identical on VMS and UNIX).

Display of the Dendrogram Generated by the 'pileup' program

The program pileup generates an output file which shows the results of the clustering process.

NOTE: This visual representation of sequence similarity must not be used as a phylogenetic tree because the length and ordering of sequences is based on sequence similarity and not on phylogenetic algorithms. For coarse reviewing of sequence relationships, however, the dendrogram could be considered.

Otherwise, use the programs distances and growtree as described below. To visualise the dendrogram, remember that you need to define the plotting environment with setplot if you did not do this earlier or work with the Wisconsin Package Interface (WPI). Eventually, define the X-Windows environment correctly. Next, issue the command

$ figure pileup.figure

Presentation of the Alignment

pretty generates an output file which shows the results of the automatic sequence alignment letter-by-letter. If you happen to have the extensions to the GCG programs installed, you can also use the command prettyplot (written by Peter Rice at EMBL) using graphics (use setplot if needed). 'prettyplot' takes the same command options as pretty. Additionally, you might try prettybox to shade the symbols (postscript printer required). To visualise the alignment, you can use a variety of special command line parameters. Use the option

$ pretty /check

It is important that you specify the multiple sequence alignment correctly, e.g.,

 
  
pileup.msf{*}   
  
is the correct way to specify a sequence alignment. To generate a sophisticated print with a consensus sequence and showing the differences only, use the command

$ pretty/cons/diff="-" pileup.msf{*}

Improving 'pretty' Output

Sometimes, the file name descriptors of the pretty output file are not needed. In this case, the replace program can be used to have the file name replaced by spaces. To accomplish this, create a text file (see the editing section for help on how to edit files) and write two periods, as well as the replacement string. If you use all default settings of pileup, such a file would be named my.replace and look like

 
..   
"@pileup.msf"	  
"           "  
where pileup.msf was the name of the msf file. The command to be issued for replacement is

$ replace pretty.pretty my.replace new.pretty

Improving 'prettyplot' Output

Occasionally, the file name descriptors of the 'prettyplot' output file are not needed. To accomplish the removal, you can either proceed as above, or specify

$ prettyplot/cons/diff="-"/shortname pileup.msf{*}

Graphic Presentation of Similarity in the Alignment

The program plotsimilarity uses a window to slide across the sequence alignment and plots the similarity of the sequences. To learn more about the options of plotsimilarity, use the check option. This program requires graphics and should only be used after the plotting environment has been defined ( setplot ). The sequence alignment should be specified as an msf file, e.g., pileup.msf{*}.

Schematic Presentation Sequence Similarity

The program 'distances' writes a text file with a matrix showing each-to-each comparison scores. It has been changed in version 8 of the GCG software and is described below .


Phylogeny

Creation of a Tree

The program pileup (see above ) is used to create a multiple sequence alignment. Then, the program distances is used to compute the distances in between the different aligned files, and growtree can draw a figure which shows the sequences arranged in a tree which represents the distances as computed from the alignment.

NOTE: This way of tree creation depends entirely on the alignment. If the similarity of the sequences considered is low, the alignment produced by 'pileup' might be faulty and require manual refinement. The resulting tree must be considered to be of significantly reduced value as possible errors will effect the resulting tree significantly.

PAUP-based Methods

At the time of writing, the programs paupsearch and paupdisplay were unfortunately not included in the GCG distribution.

Other Packages

The EGCG program tophylip will allow a conversion to the format required by the 'Phylip' program suite. Phylip (by J.Felsenstein) and other packages are also available on personal computers and macintoshes. Various public domain and shareware programs might be useful to you. Refer to the discussion on additonal software earlier in the BioCompanion for benefits and problems of adding software.


Manual Creation of Sequence Alignments

The comparison programs for pairwise comparison, such as gap must be run with the "out" option in order to generate files suitable for manual alignment:

$ gap/out

See the section on pairwise comparison for details of presentation of the output. As described in the description of the lineup editor above , sequence files are loaded into 'lineup' with the get command. The manipulation of the .out files is straightforward.

NOTE: The manual creation of multiple sequence alignments might be considerably influenced by the user and is not recommended for publication or tree creation.


Profiles

Principle

Rigorous searching implements the alignment methods used by programs like bestfit in a sequence database searching routine. The usefulness of this enhanced searching can be enhanced by using so-called profiles: Once a sequence search revealed homologies to several sequences, it is desirable to identify shared regions of homology in a multiple sequence alignment. The information buried in the alignment can be re-utilised further on to be used in analysis and searches. Remote similarities of the "twilight zone" are not necessarily easily detected by heuristic searching methods. Various algorithms implement alignment procedures known from pairwise alignments but require significantly more resources. The GCG program package currently features the profile search method from Gribskov et al.

Profile searching unites the benefit of comparison matrices with the features of sequence-specific allowance of exchanges such as already used in the pattern approach . However, the substitutions of patterns follow a yes/no scheme. To enhance sensitivity, the matrix values for a given exchange in profiles are weighted according to the observed alignment.

Profile searching is a complex method and severely depends on the input sequence diversity in order to justify extensive work. Please make sure that you have read suitable introductory literature. The GCG Program Reference Manual, for example, has a Profile Analysis Essay which you should read before you use the methods extensively.

Formats of Sequences

The data used for profile searching must be in GCG GCG format. Use reformat or genmanual sequence_exchange for details. For multiple sequence alignment, there are several possibilities of file formats to start with.

 
  
|                         | file     |  
|     type of file        | ending   |   called as (example)  
+-------------------------+----------+-------------------------  
| normal sequence file    | .seq or  |  
|                         | .pep     |   my.seq  
+-------------------------+----------+-------------------------  
| file of sequence names  | .frg and |   
| (from 'lineup', etc.)   | .fil     |   @my.fil  
+-------------------------+----------+-------------------------  
| multiple sequence files |          |  
| (from 'pileup')         | .msf     |   my.msf{*}  

Profile Generation

The program profilemake generates a profile from a set of aligned sequences in msf format.

Profile Searching

The program profilesearch uses a profile generated by profilemake and produces a listing of best-fitting sequences in a database. For aligning these with the profile the program profilesegments is required. The EGCG program tprofilesearch allows searching of a protein profile in an on-the-fly translated DNA database.

Profile Analysis

The program profilegap uses a profile generated by profilemake and compares this to a sequence with a comparison algorithm of an end-to-end alignment ( gap ).

================================= Begin Exercise 13

Understand the benefit, scope and limitations of a rigorous searching method. Generate a profile and show the difference in searching the alignment vs. searching the consensus.

================================= End Exercise 13


JAM produced file: SEQUEN12.HTML as
[next page] , or [overview] , or [table of contents]