Section 12-2: Programs to Deal with Multiple Sequences

[ Previous chapter ][ This chapter ][ Next chapter ]


Subsection 12.2.1

Manual Editing with the Multisequence Editor

NOTE: To use the programs described below, it is essential that you are familiar with single-sequence editors and file handling .

If you start from scratch, use the command

% lineup

The screen will ask

 

  
Lineup of what sequence group ?
  

  
And you are requested to type in a name (try to use less than 9 characters, and use only characters or numbers). Use all-lowercase letters (in particular, on UNIX) . The screen, will then open and with a similar display to the normal sequence editor seqed .

Add a New Sequence

Move to the command line with <CTRL><D>, give the command 'new', type the new sequence name.

 

  
      <CTRL><D>    
  
      : new
  
Create a NEW sequence (ten letter max.): 
  
test1
  

  
Move to desired position with arrows. Press <RETURN> to select position.
 

  
<CURSOR-UP> <RETURN>
  
NEW sequence named test1 placed at 0,1
  

  
Include an Existing Sequence

Move to the command line with <CTRL><D>, give the command 'get', type the new sequence name. The sequence given is either a sequence in your own directory as created with commands from the GCG package or a sequence from the database. You might need to use the import functions of the GCG package.

 

  
      <CTRL><D>    
  
      : get
  
GET what sequence: 
  
protein:mchu
  
LINEUP get of protein:mchu
  
      Begin (* 1 *) ?     <RETURN>
  
      End (* 149 *) ?     <RETURN>
  
      Reverse (* No *) ?  <RETURN>
  
      That Begins: MADQLTEEQI 
  
         and Ends: EEFQMMTAK
  
Is this what you want to include (* Yes *) ? <RETURN>
  

  
Move to desired position with arrows. Press <RETURN> to select position.
 

  
<CURSOR-UP> <RETURN>
  
Enter name (ten letter max.): Mchu <RETURN>
  
LINEUP get of "protein:mchu" from: 1 to: 149 .... 
  

  
Navigation

The 'lineup' editor works similar as the 'seqed' program discussed earlier for single sequence input. However, as multiple sequences will show as several lines, the <CURSOR-UP> and <CURSOR-DOWN> keys will be used to jump between different sequences in the alignment. The period (.) key will be used to insert gaps.

CAUTION: If you have a key mapping file in the current directory such as in use for sophisticated use of the 'seqed' program the period might be missing, therefore, does not work in 'lineup'. SOLUTION: Delete the file or add the period accordingly.

Consensus Calculation

One of the sequences (the one at line 0) is special: It might hold a consensus sequence which is automatically updated upon gap insertion or sequence shifting. To activate this mechanism, move to the move to the command line (<CTRL><D>), and type auto.

Get Help

Move to the command line (<CTRL><D>), and type help.

Exit

Move to the command line (<CTRL><D>), and type exit.


Subsection 12.2.2

Manual Editing of Sequence Alignments

If you used the lineup program earlier, it will have created a so-called file of sequence names (FOSN, extension *.fil) and numerous fragments which represent your sequences in its lined-up form. For example, to reload an alignment of the group eco, call the lineup program with

% lineup eco

If you used another program to produce a multisequence alignment (e.g., the program pileup ), this might be in the multiple sequences format ( MSF, extension *.msf). To use 'lineup' on a file called eco.msf, call it as

% lineup -msf eco

You can reformat each of the formats into the other with the command reformat. Use the section of genhelp to learn about how to convert MSF to FOSN format and FOSN to MSF format.


Subsection 12.2.3

Manual Editing of File of Sequence Names

If you need to name more than one sequence, you can use asterisks (*) as "wildcards". (See section file handling for use of wildcards in file naming conventions). The GCG programs, however, allow you to write a file which contains only filenames rather than the files itself. To create such a list file, call the system editor (see section editing ) and enter all file names of the sequences you are interested in after having entered a line with two periods ("..") as the first line. You can mix either your own sequences or use names from the database. Refer to the section on Lists for details on GCG list files.


Subsection 12.2.4

Automatic Creation of File of Sequence Names

As described above, you can use several GCG and GCG-like programs to produce a file of sequence names (also known as Lists ). Remember that the file should contain only sequence names after the two periods (This is taken care of by the GCG programs automatically if applied correctly).


Subsection 12.2.5

Automatic Generation of a Multiple Sequence Alignment

The GCG program pileup can align many sequences by specifying them either as single files using wildcards (e.g., *.seq) or by using a file of sequence names and specifying these as @my.fil (if the file my.fil contains the sequence names). (This is usually called automated multisequence alignment). As an example, the result of pileup utilising data from a findpatterns run is shown:

% pileup

 
PILEUP creates a multiple sequence alignment from a group of 
  
relate sequences using progressive, pairwise alignments.  It can 
  
also plot tree showing the clustering relationships used to 
  
create the alignment. 
  
PileUp of what sequences ?  @findpatterns.find   
  
1            MCHU   149 aa   
  
2            MCRB   148 aa   
  
3            MCRT   149 aa   
  
4            MCBO   148 aa 
  
What is the gap weight (* 3.00 *) ? <RETURN> 
  
What is the gap length weight (* 0.10 *) ? <RETURN> 
  
This program can display the clustering relationships  graphically. 
  
Do you want to:     
  
           A) Plot to a FIGURE file called "PileUp.Figure"     
  
           B) Plot graphics on HP7550 attached to /dev/tty     
  
           C) Suppress the plot Please 
  
choose one (* A *): <RETURN> 
  
The minimum density for a one-page plot is 4.0 sequences/100 platen units. 
  
What density do you want (* 4.0 *) ? <RETURN> 
  
What should I call the output file name (* pileup.msf *) ? <RETURN> 
  
Determining pairwise similarity scores...   
  
1   x     2       1.49   
  
1   x     3       1.50   
  
1   x     4       1.50   
  
2   x     3       1.49   
  
2   x     4       1.49   
  
3   x     4       1.50 
  
Aligning...   
  
1     .......-.   
  
2     .......-.   
  
3     .......-. 
  

  
FIGURE instructions are now being written into pileup.figure.        
  
Total sequences:          4       
  
Alignment length:        149               
  
CPU time:      00.84            
  
Output file:/biox/biocomputing/doelz/pileup.msf      
  

  
(The output file name would be slightly different but the procedure is identical on VMS and UNIX).


Subsection 12.2.6

Display of the Dendrogram Generated by the 'pileup' program

The program pileup generates an output file which shows the results of the clustering process.

NOTE: This visual representation of sequence similarity must not be used as a phylogenetic tree because the length and ordering of sequences is based on sequence similarity and not on phylogenetic algorithms. For coarse reviewing of sequence relationships, however, the dendrogram could be considered.

Otherwise, use the programs distances and growtree as described below. To visualise the dendrogram, remember that you need to define the plotting environment with setplot if you did not do this earlier or work with the Wisconsin Package Interface (WPI). Eventually, define the X-Windows environment correctly. Next, issue the command

% figure pileup.figure


Subsection 12.2.7

Presentation of the Alignment

pretty generates an output file which shows the results of the automatic sequence alignment letter-by-letter. To visualise the alignment, you can use a variety of special command line parameters. Use the option

% pretty -check

It is important that you specify the multiple sequence alignment correctly, e.g.,

 

  
pileup.msf{*} 
  

  
is the correct way to specify a sequence alignment. To generate a sophisticated print with a consensus sequence and showing the differences only, use the command

% pretty -cons -diff='.' pileup.msf{*}

Improving 'pretty' Output

Sometimes, the file name descriptors of the pretty output file are not needed. In this case, the replace program can be used to have the file name replaced by spaces. To accomplish this, create a text file (see the editing section for help on how to edit files) and write two periods, as well as the replacement string. If you use all default settings of pileup, such a file would be named my.replace and look like

 
.. 
  
"@pileup.msf"	
  
"           "
  
where pileup.msf was the name of the msf file. The command to be issued for replacement is

% replace pretty.pretty my.replace new.pretty


Subsection 12.2.8

Graphic Presentation of Similarity in the Alignment

The program plotsimilarity uses a window to slide across the sequence alignment and plots the similarity of the sequences. To learn more about the options of plotsimilarity, use the check option. This program requires graphics and should only be used after the plotting environment has been defined ( setplot ). The sequence alignment should be specified as an msf file, e.g., pileup.msf{*}.


Subsection 12.2.9

Schematic Presentation Sequence Similarity

The program 'distances' writes a text file with a matrix showing each-to-each comparison scores. It has been changed in version 8 of the GCG software and is described below .


[ previous chapter ],[ this chapter ][ next chapter ] , [next page/section] , or [overview] , or [table of contents]